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(2)  Clustering  algorithms  for  use  in  the  development  of  sets  of 
reference  patterns  for  speaker- independent  word  recognition* 

(3)  Automatic  enrollment  for  speaker-dependent,  connected-word 
recognition  for  syntactically  unconstrained  word  sequences  of  20  words.  ^ 

The  program  culminated  in  the  installation  of  the  speaker-independent, 
connected-digit  recognition  program  on  the  Base  and  Installation  Security 
System's  advanced  development  model  speaker  verification  system 
at  RADC. 

The  speaker-independent,  connected-digit  recognition  portion  of 
this  study  resulted  in  a significantly  faster  algorithm  with  a 
50-percent  decrease  in  error  rate  over  the  course  of  this  study-from 
9.5  percent  to  9.7  percent  on  an  evaluation  data  set  of  10  six-digit 
sequences  from  .106  speakers  (64  males,  42  females). 

The  development  of  the  clustering  algorithm  resulted  in  a two-stage, 
four-path  algorithm  with  the  mechanisms  for  detecting  outlying  data 
points  in  the  design  data  and  with  subsequent  analysis  routines  for 
j comparing  the  results  from  the  various  paths  and  testing  the  validity 

I of  resulting  clusters  on  the  basis  of  comparisons  with  a priori 
information  about  the  design  data  set. 


The  research  into  development  of  an  automatic  enrollment  technique 
for  speaker-dependent  word  recognition  resulted  in  a method  that 
yielded  very  good  results  for  isolated  word  recognition  but  less 
acceptable  results  when  used  in  continuous  speech  from  the  same 
speaker.  The  better  results  achieved  with  comparable  hand  enrollments 
point  to  the  need  for  continued  development  of  automated  enrollment, 
and  the  desirability  of  an  interim  solution  of  a semiautomated 
enrollment  procedure  allowing  the  operator  the  option  of  modifying 
reference-point  locations  and  recognition-pattern  format  definitions 
defined  by  an  automated  front  end.  Independent  of  the  enrollment 
method,  however,  the  benefit  of  reference  file  updating  as  a means 
of  accommodating  contextual  variability,  as  well  as  intersessioa  variability, 
was  abundantly  clear. 
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EVALUATION 


The  objective  of  this  program  was  to  develop  techniques 
and  algorithms  to  extend  highly  reliable  speaker-dependent  isolated 
word  recognition  to  speaker-dependent  continuous  word  recognition 
and  study  the  methodology  for  speaker-independent  continuous 
speech  recognition. 

i 

A hardware/software  implementation  of  a real-time  continuous 
speech  recognition  system  was  fabricated  by  Texas  Instruments  (TI). 

This  system  was  extensively  tested  and  modified  to  incorporate  the 
results  of  the  tests  and  a continual  upgrade  of  the  system  took 
place  over  the  life  of  the  contract.  TI  based  their  real-time 
speech  recognition  system  on  techniques  they  developed  for  automatic 
speaker  verification  and  the  Total  Voice  Verification  program 
which  used  a restricted  connected  digit  capability. 

The  speaker-independent,  connected  digit  performance  resulted 
in  95.3  percent  recognition  accuracy  on  a data  set  consisting  of 
10  six-digit  sequences  from  106  speakers  (64  males  and  42  females). 


The  capability  that  TI  has  developed  under  subject  program 
has  been  installed  at  RADC  for  further  test  and  evaluation.  The 
RADC  tests  shall  attempt  to  establish  the  effectiveness  of  the 
current  state-of-the-art  connected  speech  system  as  to  its 
applicability  to  operational  military  requirements. 

RICHARD  S.  VONUSA 

Project  Engineer  j 
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SECTION  I 
INTRODUCTION 


This  final  report  covers  the  research  done  on  a limited-vocabulary  continuous  word 
recognition  study  undertaken  by  Texas  Instruments.  This  effort  was  divided  into  two  primary 
areas  of  investigation:  extension  of  speaker-dependent  isolated  word  recognition  to  speaker- 
dependent  continuous  word  recognition,  and  the  study  of  speaker-independent  continuous  speech 
recognition. 

Speaker-dependent  isolated  word  recognition  is  currently  being  used  for  applications  such 
as  map  data  entry.  Extension  to  speaker-dependent  continuous  word  recognition  is  a more  1 

natural  one  for  the  time  normalization  techniques  used  at  Texas  Instruments  (described  in 
Section  II)  than  the  type  that  depends  on  locating  endpoints  of  words,  which  may  not  even  exist 
(e.g.,  when  phonemes  are  shared  (/s/  in  six-seven)  between  words  in  continuous  speech]. 

Speaker-dependent  word  recognition  uses  speaker-dependent  reference  patterns  obtained  in  a 

single  enrollment  session.  A method  of  automatic  enrollment  and  supervised  updating  to 

accommodate  intersession  variations  and  context  dependencies  were  investigated  during  this  / 

study. 

For  many  years,  the  approach  to  the  problem  of  speaker-independent  recognition  of 
continuous  speech  has  been  a heuristically  directed  search  for  the  correct  features  and  weightings 
for  the  hierarchical  classification  of  a set  of  symbol  strings,  mapping  ultimately  into  an 
English-language  transcription.  The  emphasis  has  been  on  getting  out  of  the  acoustic  and  into  the 
phonemic  domain  as  quickly  as  possible  because  of  the  huge  memory  requirements  for  storing 
acoustic  data  for  large  vocabularies.  Since  the  heuristics  were  often  based  on  the  researcher’s 
judgment,  derived  from  often  insufficient  data,  the  consequent  mislabelings  had  to  be  corrected 
with  progressively  more  complex  classification  algorithms.  Design  and  testing  using  small  data 
bases,  along  with  the  use  of  phonemic  representations  of  speech  have  resulted  not  only  from 
memory  limitations  but  also  from  the  lack  of  techniques  in  speech  for  dealing  with  very  large 
amounts  of  data.  Within  the  last  few  years,  however,  work  on  such  techniques  has  begun  to 
appear.  During  the  Total  Voice  Speaker  Verification  study,1  performed  by  Texas  Instruments 
under  RADC  sponsorship,  a clustering  algorithm  was  developed  and  used  to  produce  a set  of 
speaker-independent  reference  patterns  for  use  in  speaker-independent,  connected-digit 
recognition.  The  current  study  then  concentrated  on  two  tasks  for  speaker-independent, 
continuous-speech  recognition.  One  task  was  to  determine  the  performance  that  could  be 
achieved  on  speaker-independent,  connected-digit  recognition  using  the  previously  developed 
reference  patterns  by  making  improvements  to  the  word  recognition  algorithm.  The  other  task 
was  to  investigate  improvements  that  could  be  made  to  the  clustering  algorithm  for  the  purpose 
of  finding  better  partitions  of  the  design  data  set. 

Section  II  of  this  report  reviews  the  speech  technology  used  at  Texas  Instruments  and 
covers  an  improved  directed  graph  searching  algorithm  developed  during  this  contract.  Section  111 
covers  an  automated  enrollment  method  for  speaker-dependent,  connected-word  recognition  and 
the  role  of  reference-pattern  updating.  Section  IV  describes  the  application  of  clustering  to 
speaker-independent  reference-pattern  generation  and  covers  the  algorithm  extensions  developed 


1 R.L.  Davis,  B.M.  Hydrick,  and  G.R.  Doddington,  “Total  Voice  Speaker  Verification,”  Rome  Air  Development 
Center  Technical  Report,  RADC-TR-78-260,  january  1978. 
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during  this  contract.  Section  V covers  several  general-purpose  speech-processing  capabilities  that 
center  on  the  use  of  direct  speech  input/output  (I/O)  to  a fast  array  processor.  The  experimental 
results  for  both  the  extensive  testing  performed  for  speaker-independent,  connected-digit  recogni- 
tion and  the  more  limited  testing  done  for  speaker-dependent,  continuous-speech  recognition  are 
covered  in  Section  VI.  Conclusions  and  recommendations  are  made  in  Section  VII. 
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SECTION  II 

CONTINUOUS  SPEECH  RECOGNITION 

During  the  relatively  short  history  of  continuous  speech  recognition  work,  the 
classification  schemes  have  used  a feature  abstraction  process  from  the  speech  waveform  followed 
by  a hierarchical  classifier.  The  level  of  abstraction  varied  from  features  of  the  waveform  itself  to 
symbol  representations  (phonemes)  requiring  highly  sophisticated  classification  techniques  in 
order  to  compensate  for  segmentation  and  labeling  errors.  The  classification  complexity  generally 
was  proportional  to  the  level  of  abstraction.  Martin2  shows  a tree  (Figure  I)  of  feature 
abstraction  levels. 

The  usual  argument  for  using  a symbol  is  that  it  offers  a more  compact  representation  of 
words  and,  hence,  growth  in  the  memory  requirement  is  not  so  dramatic  with  increase  in 
vocabulary  size.  However,  as  Reddy3  points  out,  good  signal-to-symboi  transformation  techniques 
currently  do  not  exist,  causing  size  increases  in  the  lexicons  and  the  algorithms,  not  only  to 
account  for  context,  dialect,  and  idolect  variations,  but  also  to  account  for  mislabeled  acoustic 
events. 

Therefore,  reference-pattern  matching  in  the  signal  domain  has  the  advantage  of  not  having 
to  accommodate  feature  abstraction  errors.  Three  crucial  problems  are  involved,  however: 
selection  of  the  speech  representation,  time  normalization  of  the  speech  signal  for  matching  with 
reference  patterns,  and  selection  of  the  reference  patterns  themselves.  The  first  two  of  these 
topics  are  discussed  in  the  remainder  of  this  section  and  the  reference-pattern  selection  is  the 
subject  of  the  more  extended  discussions  in  Sections  III  and  IV. 

A.  SPEECH  REPRESENTATION 

The  specific  speech  representation  used  in  this  study  was  the  output  of  a 1 6-channel 
digital  filter  bank  preceded  by  a first-order  differencing  network  (for  preemphasis).  Each  of  the 
bandpass  filters  is  a two-section,  cascaded,  second-order  Bessel  filter  followed  by  a rectifier  and  a 
lowpass  filter  sampled  every  10  milliseconds  (ms).  Center  frequencies  and  bandwidths  for  the  16 
filters  are  given  in  Table  1 . 

An  important  consideration  is  the  choice  of  wide-bandwidth  filters  that  locate  spectral 
peaks  but  that  avoid  resolving  the  voice  fundamental  and  its  harmonics.  Note  that  the  center 
frequencies  for  filters  14  through  16  lie  in  the  part  of  the  spectrum  primarily  occupied  by 
energy  only  during  fricatives.  The  exception  is  the  third  formant  for  the  vowel  /i/  for  males  and 
for  all  the  front  vowels  for  females.  Since  no  precise  resolution  of  the  frequency  location  ;s 
possible  with  the  wide-bandwidth  filters,  the  only  interest  is  the  presence  or  absence  of  a third 
formant  (in  which  case  other  formants  would  also  exist  in  lower  filters)  or  the  presence  or  absence 
of  energy  anywhere  in  the  frequency  band  of  the  top  three  filters  without  lower  frequency  energies. 
In  order  to  compact  the  filter  bank  representation,  the  top  three  filters  were  added  into  one 
value,  without  averaging  because  of  the  depressed  amplitudes,  yielding  a 14-element  vector  to 
represent  the  speech  spectrum: 


2T.B.  Martin,  “Acoustic  Recognition  of  a Limited  Vocabulary  in  Continuous  Speech,”  Ph.D.  Dissertation, 
University  of  Pennsylvania,  1970. 

3D.R.  Reddy,  “Speech  Recognition  by  Machine:  A Review,”  Proceedings  of  the  IEEE,  64:501  531, 
April  1976. 
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Figure  1.  Hierarchical  Organization  of  Feature  Abstraction  Network  (Prom  Martin2) 


This  14-element  vector  is  regressed  (Appendix  A)  using  a sine  and  a cosine  basis  function  to 
eliminate  gross  aspects  ot  the  spectrum  and  to  flatten  the  spectrum.  The  two  regression 
coefficients,  c,  and  c2,  along  with  a measure  of  the  energy  in  filters  2 through  13  (vowel 
energy),  are  concatenated  to  the  14  regressed  filter  outputs.  All  elements  except  the  energy  are 
then  normalized  and  quantized  to  one  of  eight  equiprobable  values,  resulting  in  a speech 
representation  such  as  that  shown  in  Figure  2 for  the  word  “seven.”  The  form  shown  in  Figure  2 
is  used  throughout  the  remainder  of  this  report.  The  values  of  the  normalized,  quantized  and 
the  two  regression  coefficients  are  indicated  by  the  density  of  the  printed  symbols  according  to 
the  following: 


Value:  0 1 2 3 4 5 6 7 

Symbol:  blank  , ” + = O B $ 

At  this  point  the  energy  is  not  quantized;  however,  it  is  always  used  relative  to  other  energies 
and  the  relative  value  is  then  quantized.  Further  detail  of  the  speech  representation  can  be  found 
in  the  Total  Voice  Speaker  Verification  study  final  report.1 


TABLE  I.  CHARACTERISTICS  OF  16-CHANNEL  FILTER  BANK 


Filter 

Center  Frequency 
(Hz) 

Bandwidth 
(Hz,  at  -6  dB) 

1 

280 

250 

2 

395 

280 

3 

525 

310 

4 

630 

340 

5 

750 

360 

6 

900 

360 

7 

1080 

360 

8 

1265 

365 

9 

1480 

365 

10 

1725 

365 

1 1 

1985 

365 

12 

2285 

360 

13 

2640 

365 

14 

3150 

625 

15 

3720 

635 

16 

4235 

615 

s 
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Figure  2.  Quantized  Spectral  Speech  Representation 
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B.  TIME  NORMALIZATION 

One  of  the  basic  problems  in  speech  processing  is  time  alignment  of  the  speech  waveform 
with  respect  to  a reference.  For  example,  in  the  two  spectrograms  in  Figure  3 for  the  word 
“seven,”  the  time  differences  between  corresponding  As  (which  denote  phonemic  boundaries)  are 
obvious. 

Early  work  used  linear  time  normalization  of  two  patterns  between  endpoints  of  words, 
and  although  this  method  improved  recognition  performance,  it  suffered  from  an  inability  to 
deal  with  the  nonlinear  fluctuations  between  endpoints  and  to  locate  endpoints  in  continuous 
speech  reliably. 

Two  distinct  approaches  developed  during  the  late  1960s  and  early  1970s.  One  approach 
(most  of  the  ARPA  sponsored  work:  Reddy3)  was  based  on  translating  a string  of  input  features 
into  a sequence  of  phonemic  labels,  a procedure  dependent  on  accurate  segmentation  between 
phonemes.  Segmentation  and  labeling  errors  were  then  repaired  by  more  sophisticated  subsequent 
processing  using  syntax,  semantics,  etc. 

The  other  method  approached  the  problem  by  a nonlinear  warping  of  the  time  axis  of  a 
feature  waveform  of  the  input  speech  to  obtain  maximum  coincidence  with  a reference 
waveform.  This  approach  was  used  by  both  Doddington4  and  Sakoe  and  Chiba;5  however,  the 
latter’s  approach  could  be  more  easily  represented  in  a form  amenable  to  the  use  of  d*  namic 
programming,  useful  in  easing  the  computation  burden.  This  dynamic  programming  approach  has 


4G.R.  Doddington,  “A  Method  of  Speaker  Verification,”  Ph.D.  Dissertation  (Thesis),  University  of 
Wisconsin,  1970. 

5H.  Sakoe  and  S.  Chiba,  “A  Dynamic  Programming  Approach  to  Continuous  Speech  Recognition,” 
Proceedings  of  the  7th  International  Congress  on  Acoustics,  August  1971. 
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Figure  3.  Demonstration  of  Need  for  Time  Alignment  Between  Spectra  for  Two  “Sevens' 
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Figure  4.  Relationship  of  Piecewise  Linear  Time  Normalization  to  Nonlinear  Time  Normalization 


been  used  by  Velichko  and  Zagoruiko,6  Itakura,7  White  and  Neely,8  and  Sakoe  and  Chiba9  in 
isolated  word  recognition.  Extension  to  continuous-word  recognition  has  been  done  by  Lowerre10 
on  the  HARPY  speech  recognition  system  (also  described  briefly  by  White"),  by  Porter,12  and  by 
Nippon  Electric  Company  in  their  DP- 100  Connected  Speech  Recognition  System. 

The  technique  used  at  Texas  Instruments  is  an  amalgamation  of  these  two  methods  and 
was  first  used  by  Doddington13  in  1973.  In  this  method,  potential  acoustic  boundaries 
(reference  points)  are  first  located  in  the  input  waveform.  Reference  points  are  combined  into 
optimal  sequences  for  words  in  the  vocabulary  using  a dynamic  programming  routine  that  uses  a 
measure  of  how  reliably  the  reference  points  were  located  and  that  accounts  for  the  deviations 
from  expected  time  differences  between  reference  points. 

After  sequences  of  potential  reference  points  have  been  identified,  the  input  waveforms  are 
interpolated  linearly  between  reference  points  to  form  a time  normalized  representation  of  the 
utterance.  The  relationship  to  the  Sakoe/Chiba  approach  can  be  seen  in  Figure  4.  Essentially, 
only  those  points  along  the  path  of  the  time  warp  that  represent  acoustic  boundaries  are  found 
(the  o's  in  Figure  4),  and  the  linear  interpolation  is  then  performed  between  these  reference 
points.  A piecewise  linear,  time-normalized,  acoustic  representation  of  the  word  (a  “recognition 
pattern”)  is  thus  formed.  A sample  of  the  spectral  data  portion  of  a recognition  pattern  being 
extracted  from  input  speech  spectra  is  shown  in  Figure  5. 

As  an  example  of  the  choice  of  reference-point  locations,  the  reference  points  (As)  for  the 
digits  are  shown  in  Table  2 for  the  phonetic  transcriptions  of  the  general  American  dialect 
pronunciations  for  the  digits  as  found  in  Kenyon  and  Knott.14  These  locations  were  chosen  at 
points  that  would  exhibit  large  spectral  changes.  The  actual  rules  used  in  extracting  recognition 
patterns  for  the  10  digits  are  also  specified  in  table  2,  where: 

( 1 ) Initial  negative  numbers  indicate  the  columns  for  extrapolation  before  the  first 
reference  point 

(2)  Intermediate  numbers  in  parentheses  indicate  the  number  of  columns  for 
interpolation  between  reference  points 

(3)  The  remaining  numbers  indicate  columns  for  extrapolation  after  the  last  reference 
point. 


6V.M.  Velichko  and  N.G.  Zagoruiko,  “Automatic  Recognition  of  200  Words,”  International  Journal 
Man-Machine  Studies,  2:223,  June  1970. 

7F.  Itakura,  “Minimum  Prediction  Residual  Principle  Applied  to  Speech  Recognition,”  IEEE  Transactions  on 
Acoustics.  Speech  and  Signal  Processing,  ASSP-23:67-72,  February  1975. 

8G.M.  White  and  R.B.  Neely,  “Speech  Recognition  Experiments  With  Linear  Prediction,  Bandpass  Filtering, 
and  Dynamic  Programming,”  IEEE  Transactions  on  Acoustics,  Speech  and  Signal  Processing,  ASSP-24.183  -188, 
April  1976. 

9H.  Sakoe  and  S.  Chiba,  “Dynamic  Programming  Algorithm  Optimization  for  Spoken  Word  Recognition," 
IEEE  Transactions  on  Acoustics.  Speech  and  Signal  Processing.  ASSP-26:43  49,  February  1978. 

I0B.T.  Lowerre,  “The  HARPY  Speech  Recognition  System,”  Ph.D.  Dissertation  (Thesis),  Carnegie-Mellon 
University,  1976. 

”G.M.  White,  “Continuous  Speech  Recognition:  Dynamic  Programming,  Knowledge  Nets  and  HARPY,” 
Paper  28-2,  1978  WESCON  Professional  Program,  September  1978. 

,2J.E.  Porter,  “LISTEN:  A System  for  Recognizing  Connected  Speech  Over  Small,  Fixed  Vocabularies  in  Real 
Time,”  Naval  Training  Equipment  Center  Technical  Report,  NAVTRAEQU1PCEN  77-C -0096-1 , April  1978. 

"G.R.  Doddington.  “Speaker  Verification,”  Rome  Air  Development  Center  Technical  Report,  RADC-TR- 
74-179,  April  1974. 

14J.S.  Kenyon  and  T.A.  Knott,  A Pronouncing  Dictionary  of  American  English,  G.  & C.  Merriam  Company 
(Springfield,  Massachusetls,  1953). 
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Figure  S.  Example  of  Recognition  Pattern  Formation 


At  this  point,  the  speech  representation  is  still  in  the  acoustic  domain,  differing  from 
those10,15  who  transform  their  time-warped  segments  into  phonemic  labels  with  associated 
transition  probabilities  between  labeled  states.  The  advantage  of  remaining  in  the  acoustic 
domain  is  that  it  avoids  an  intermediate  classification  that  would  introduce  errors  and  obviates 
the  need  to  find  every  phonetic  boundary,  which  is  helpful  when  such  boundaries  are  difficult  to 
find. 


,5F.  Jelinek,  “Continuous  Speech  Recognition  by  Statistical  Methods,”  Proceedings  of  the  IEEE,  64:532  556, 
April  1976. 
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TABLE  2.  RECOGNITION  PATTERN  FORMAT  DEFINITIONS 
FOR  THE  DIGITS 


1 


z 

1 r o 

f a 1 v 

AAA 

A A 

4,  -2 

(4)  (4) 

4,  -2  (6)  2,  4 

w 

A n 

s I k s 

A A 

AAA 

4,  -2 

(6)  2,4 

4,-2  (4)  (2)  2,4 

4,  2 

t u 

A A A 

(2)  (6) 

SA  6 Av3An 

4 2 (4)  (4)  2,  4 

e 

r i 

e t 

A A 

A A 

4,  2 

(6) 

(6)  2,  4 

f 

o r 

n a I n 

A A 

A A 

4,  -2 

(6)  2 

4,-2  (6)  2,4 

C.  REFERENCE-POINT  LOCATION 

A presupposition  of  this  piecewise,  linear,  time-normalization  technique  is  extremely 
accurate  reference-point  location.  One  approach  would  be  vocabulary-independent,  locating 
changes  in  features  such  as  voicing,  energy,  or  spectrum  between  adjacent  time  samples.  This  is  a 
reliable,  precise  method  for  use  in  speaker-dependent  recognizers;  however,  sometimes  expected 
acoustic  segmentation  points  are  missed  in  speaker-independent  recognition. 

A more  robust  approach  is  to  use  a vocabulary-dependent  approach  (similar  to  the 
“transeme”  approach  used  at  IBM16),  matching  a feature  vector  (called  a “scanning  pattern”) 
extracted  from  the  input  speech  waveform  to  reference  scanning  patterns,  or  templates.  Figure  6 
shows  a scanning  pattern  being  extracted  from  the  spectral  input.  Matching  is  performed  by 
computing  a distance  between  the  input  and  all  reference  patterns  for  every  frame  (10  ms  in  this 
study).  Minima  in  this  distance  function  are  locations  of  potential  acoustic  boundaries  (reference 
points). 

More  specifically,  the  scanning  pattern  formed  at  time  tj  consists  of;  (1)  the  spectral  data, 
regression  coefficients,  and  energy  for  the  five  time  samples  from  tj.  2 through  tj+2  and  (2)  the 
difference  between  the  data  for  all  adjacent  pairs  of  time  samples.  The  energy  used  in  the 
scanning  pattern  is  tire  energy  for  each  of  the  five  columns  of  data,  normalized  by  the  sum  over 
all  five  columns  and  quantized  to  4 bits.  Figure  6 illustrates  the  formation  of  a scanning  pattern 
from  preprocessed  (regressed,  normalized,  quantized)  speech  data.  The  only  purpose  of  the 
difference  data  is  to  weight  more  heavily  rapid  changes  of  the  feature  vectors  with  respect  to 
time.  Since  these  data  are  derived  from  the  standard  data  portion  of  the  scanning  pattern, 
subsequent  illustrations  of  scanning  patterns  in  this  report  will  not  show  the  difference  data, 
even  though  it  is,  in  fact,  part  of  the  actual  pattern. 


I6N.R.  Dixon  anil  H.F.  Silverman,  “The  1976  Modular  Acoustic  Processor  (MAP),”  ZEE E Transactions  on 
Acoustics,  Speech  anil  Signal  Processing,  ASSP-25:367  379,  October  1977. 
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Figure  6.  Example  of  Scanning  Pattern  Formation 
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In  order  to  determine  where  reference  points  occur  in  the  input  speech,  the  input  data  are 
compared  with  reference  data . This  procedure  (called  scanning)  is  done  by  formatting  scanning 
patterns  from  the  input  speech  at  each  time  sample  t,  comparing  these  with  predetermined 
reference  scanning  patterns  Tk,  and  obtaining  a measure  of  squared  difference  between  the  two, 
called  the  scanning  error: 


164 


i=l 


The  final  error  associated  with  each  reference  point  is  the  minimum  error  of  all  comparisons 
with  patterns  representing  that  reference  point. 

Using  the  scanning  errors  as  a function  of  time,  an  error  function  is  thus  generated  for 
each  type  of  reference  scanning  pattern  using  the  minimum  scanning  error  for  each  pattern  type 
for  each  time  sample.  (Multiple  reference  scanning  patterns  may  be  allowed  for  each  reference 
point  of  each  word.)  Each  function  is  monitored  for  dips  of  sufficient  magnitude  to  be 
considered  as  potential  locations  of  the  corresponding  reference  points  in  the  input  data.  These 
dips  are  called  valley  points  when  the  ratio  of  the  scanning  error  following  the  dip  to  the 
scanning  error  at  the  dip  itself  is  greater  than  or  equal  to  a specified  peak-to-valley  ratio  (PVR), 
typically  1.1  to  1.3,  and  the  magnitude  of  the  scanning  error  for  the  valley  point  is  less  than  or 
equal  to  a threshold,  typically  600  to  1,200.  The  occurrence  of  a peak  (verified  when  the  ratio 
of  the  scanning  error  following  the  peak  to  the  scanning  error  at  the  peak  is  less  than  the 
reciprocal  of  the  PVR)  is  required  before  another  valley  point  can  be  found.  The  valley-finding 
procedure  is  shown  in  Figure  7. 

D.  WORD  HYPOTHESIZING  AND  TESTING 


Once  these  valleys  in  the  scanning  error  (potential  reference  points)  have  been  found,  the 
next  task  is  to  fit  them  together  to  form  word  hypotheses.  A sequence  of  time-ordered 
reference-point  hypotheses  for  a word  must  exist,  and  the  time  distance  between  each  pair  of 
reference  points  must  satisfy  word-specific  minimum/maximum  restrictions.  The  error  determined 
for  each  reference  point  pair  is  weighted  by  deviations  from  the  expected  distance  between  the 
two  points  and  the  scanning  error  at  each  hypothesized  reference  point.  The  weighted  error  lor 
reference  points  i and  j is. 


(e,  + offset)  (ej  + offset) 


where 


L-ii  . . ~ 

1024 

dtu  = tj  tj 

0=2 

dtj  j = expected  dt,  , 

^min  = ^ 

dtjj  — max  (dtj j dtmin ) 

offset  = 100 

e^  ej  = scanning  error  for  reference  points  i,  j 

1 + 0 


^ dtjj  dtuy 


(2) 
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Figure  7.  Example  of  Valiev  Finding 


If  the  hypothesized  word  sequence  error  (SQ),  which  is  the  sum  of  the  Ewi j for  all  reference 
point  pairs  in  the  word,  is  less  than  a predetermined  word-specific  threshold,  then  a word  has 
been  hypothesized. 

To  test  this  hypothesized  word,  the  time-normalized  recognition  pattern  anchored  at  the 
corresponding  hypothesized  reference  points  is  compared  to  time-normalized  reference 
recognition  pattern(s)  for  the  hypothesized  word,  using  the  squared  Euclidian  distance.  This 
distance  (or  the  minimum  distance,  in  the  case  of  multiple  reference  patterns)  is  the  recognition 
error  (TE),  which  is  used  along  with  the  sequence  error  (SQ)  in  computing  a total  normalized 
error  (NE)  for  the  word  “k”  as  given  below: 

TEk/No.  of  columns  in  word  k SQk 

NEk  = + wk (3) 

normalizing  constant  for  word  k 10  (NPP) 

| 

where  NPP  is  the  number  of  reference-point  pairs  used  in  computing  SQk.  If  the  NE  for  the 
hypotheiszed  word  is  above  a predefined  threshold  or  if  the  average  energy  across  the  recognition  i 

pattern  is  less  than  a threshold,  the  word  is  discarded.  Otherwise,  it  is  placed  into  a table  of 
hypothesized  words,  along  with  the  SQ,  TE,  NE,  average  energy,  reference-point  times,  and  k 

scanning  errors  for  that  word.  Table  3 is  an  example  table  of  hypothesized  words  for  a sequence 
of  six  digits  spoken  continuously.  Note  the  existence  of  the  35  superfluous  words  in  this  case. 

The  optimal  searching  of  this  table  (or  directed  graph)  is  described  in  the  next  subsection. 

E.  AN  EFFICIENT  TREE-SEARCHING  ALGORITHM 

Once  the  current  speech  segment  has 
been  completed,  the  sorter!  table  of  hypothe- 
sized words  must  be  searched  to  find  the 
sequence  having  the  minimum  error.  Note  that 
this  corresponds  to  finding  the  best  path 
through  a directed  graph,  such  as  the  simple 
one  shown  in  Figure  8,  taken  from  Porter.13 
The  correspondence  to  the  directed  graph  can 
be  seen  most  readily  from  a specific  example. 

Table  4 shows  the  table  of  hypothesized  words 
(digits,  in  this  case)  sorted  according  to  the 
time  (in  centiseconds)  of  the  final  reference 
points.  Each  entry  in  this  table  can  point  back 
in  time  to  all  previous  digits  with  earlier  final 
reference-point  times  (less  than  the  initial 
reference-point  time  of  the  entry  being 
considered).  The  allowable  range  of  back- 
pointers  can  be  limited  by  requiring  that  the 
time  difference  between  the  first  reference 
point  of  each  word  and  the  last  reference  point 
of  preceding  words  lie  between  a specified 
minimum  and  maximum. 

Clearly,  the  exhaustive  search  time  for 
traversing  all  possible  paths  increases  rapidly. 


TABLE  3.  SAMPLE  INFORMATION  CONTAINED  IN  TABLE  OF  HYPOTHESIZED  DIGITS 
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TABLE  4 SAMPLE  OF  TABLE  OF  BACKPOINTERS  FOR  BEST  WORD 
SEQUENCES  IN  SORTED  TABLE  OF  HYPOTHESIZED  WORD 
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Syntactic  constraints,  such  as  those  used  in  the  total  voice  speaker  verification  study,  can 
sometimes  aid  the  efficiency  of  sequence  finding,  depending  on  whether  the  constraints  apply 
locally  to  woril  pairs  or  more  globally  to  the  utterance  as  a whole.  If  syntactic  constraints  have 
been  imposed  in  order  to  increase  sequence  recognition  performance,  such  as  in  the  total  voice 
application,  these  constraints  can  be  incorporated  into  the  tree-searching  algorithm  to  eliminate 
searching  branches  that  are  not  syntactically  correct. 

An  even  greater  potential  saving  may  be  obtained  by  saving  optimal  subsequences  to  avoid 
repetitious  searching  of  the  same  path.  However,  this  technique  is  not  acceptable  for  sequences 
that  are  syntactically  constrained  since  the  saved  optimal  subsequence  may  not,  in [ fact,  satisfy 
the  constraints  when  those  constraints  are  applied  to  the  entire  sequence.  In  6ther  words,  the 
“correct”  sequence  (satisfying  the  syntactic  constraints)  may  in  fact  not  be  the  “best”  (lowest 
error)  sequence.  Since  the  saving  of  optimal  subsequences  only  finds  one  best  sequence  of  a 
given  length,  this  method  is  not  appropriate  to  syntactically  constrained  word  sequences. 

For  unconstrained  sequences,  however,  this  technique  of  saving  optimal  subsequences  can 
shorten  the  exhaustive  search  time  to  be  proportional  only  to  the  number  of  table  entries 
preceding  a table  entry.  As  an  example,  Table  4 gives  the  resulting  subsequence  backpointers  for 
the  sorted  table  of  hypothesized  digits  for  the  six-digit  sequence  in  Table  3.  These  backpointer 
lists  are  constructed  from  the  bottom  up,  according  to  the  algorithm  shown  in  Figure  9. 
Essentially,  the  backpointer  for  sorted  table  index  i,  sequence  length  k.  points  to  the  sorted  table 
index  j (j  <i)  that  has  the  minimum  error  for  a subsequence  of  length  ( k — 1 ) for  all  j satisfying 
the  relation: 


Atn,in 


tUnal  1 < At 
ll  ’ ^ ulmax 


where  Atmin  = 3 cs  and  Atmax  = 1 20  cs  were  used  in  this  study. 

As  each  new  entry  in  the  list  of  backpointers  is  constructed,  it  is  compared  to  the  best 
(lowest  total  error)  subsequence  of  the  same  length.  If  the  newer  entry  has  a lower  error,  this 
error  replaces  the  best  error,  and  the  pointer  to  the  best  sequence  of  that  length  is  changed  to 
point  to  the  sorted  table  entry  currently  under  consideration.  After  the  final  sorted  table  index 
has  been  completed,  the  array  of  pointers  to  the  subsequences  having  the  lowest  error  contains 
the  optimal  results  of  the  search.  II  the  length  of  the  sequence  has  been  constrained,  all  that  is 
necessary  is  to  select  the  backpointer  for  the  specified  length  sequence.  If  not,  the  second  half  of 
the  algorithm  shown  in  Figure  9 is  used  to  determine  the  best  sequence  out  of  those  specified  by 
the  array  of  pointers  to  the  best  sequence. 

Although  the  algorithm  described  in  this  subsection  was  developed  as  a natural  extension 
to  existing  sequence  finders  that  did  not  save  optimal  subsequences,  reference  should  be  made  to 
the  work  of  others  on  this  problem.  Most  widely  published  is  the  work  of  Jelinek  et  al.  at 
IBM.15,17,18  The  IBM  work  was  predicated  on  a phonemic  representation  of  the  recognized 
speech.  The  descriptions  were  in  terms  of  probabilistic  finite  state  machines  where  the  recognized 
phonemes  are  outputs  of  state-to-state  transitions,  with  all  state  transitions  having  associated 
a priori  probabilities. 


i7F.  Jelinek,  L.R.  Bahl,  and  R.L.  Mercer,  “Design  of  a Linguistic  Statistical  Decoder  for  the  Recognition  of 
Continuous  Speech,”  IT.T.T.  Transactions  on  Information  Theory,  IT-21 : 250  256,  May  1975. 

18 L.R.  Bahl  and  F.  Jelinek,  “Decoding  for  Channels  With  Insertions.  Deletions,  and  Substitutions  With 
Applications  to  Speech  Recognition,”  Transactions  on  Information  Theory,  IT-2 1 :404  411,  July  1975. 
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Figure  9.  Flow  Chart  for  Efficient  Tree-Searching  Algorithm 


Much  closer  to  the  work  of  this  subsection  is  tliut  of  Porter,12  whose  MINT  algorithm  is 
used  to  find  the  highest  probability  path  through  a directed  graph  of  hypothesized  words,  such 
as  that  shown  in  Figure  8.  The  nodes  (rather  than  the  transitions,  as  in  t he  IBM  work)  are  each 
associated  with  a hypothesized  word  and  the  edges  represent  backpointers  to  all  allowable 
predecessors.  The  solution  to  the  real-time  processing  is  almost  the  same  as  that  described  here; 
each  node  need  be  processed  only  once,  rather  than  each  sequence  once,  by  saving  backpointers 
to  the  optimal  subsequence.  The  only  difference  is  that,  in  the  present  study,  N optimal 
subsequences  are  saved  (N  is  the  maximum  allowable  length  sequence),  whereas  just  one 
backpointer  is  saved  by  Porter.  Subsequent  postprocessing  in  the  present  study  then  selects 
which  of  the  final  N sequences  is  the  best. 

The  interested  reader  is  referred  to  Porter12  for  an  extended  discussion  of  the  probabilistic 
basis  for  this  procedure. 

F.  OVERALL  WORD  RECOGNITION  ALGORITHM 

Word  recognition  at  Texas  Instruments  is  currently  based  on  the  piecewise-linear 
time-normalization  technique  (Subsection  II. B)  of  finding  potential  acoustic  boundaries 
(reference  points)  and  fitting  sequences  of  reference  points  together  to  form  hypothesized  words. 
Time-normalized  spectral  patterns,  formatted  for  the  hypothesized  words,  are  then  compared  to 
reference  patterns  (for  either  speaker-independent  or  speaker-dependent  recognition  of  either 
continuous  or  discrete  speech).  If  the  comparison  is  a good  enough  match,  the  time  of  the  first 
and  last  reference  points  and  a total  normalized  error  (distance  between  the  input  and  reference) 
for  the  word,  along  with  the  label  for  the  word,  are  stored  in  a table  of  hypothesized  words. 

After  the  utterance  has  been  completed,  the  table  is  sorted  by  time  of  occurrence  of  the 
final  reference  point  and  is  then  used  by  the  tree-searching  algorithm  described  in  the  previous 
subsection  to  find  the  best  sequence  of  words.  A summary  flow  chart  for  the  word  recognition 
programs  investigated  during  this  contract  is  shown  in  Figure  10. 

Three  specific  computer  programs  were  generated  during  this  study: 

DKiREC,  for  speaker-independent  recognition  of  connected  digits  using  only  the 
Tl  080  minicomputer 

DIGRCT,  for  speaker-independent  recognition  of  connected  digits  using  the  AP  1 20B 
array  processor  for  filtering  and  preprocessing 

RTFNR,  for  speaker-dependent  word  recognition  with  automatic  enrollment. 

Note  that  for  DIGREC,  sampling  is  stopped  during  the  sequence  finding  that  is  done  after  the 
complete  utterance  has  been  input. 

The  primary  differences  between  RTFNR  and  both  DIGREC  and  D1GRCT  is  the  source 
for  the  reference  scanning  and  recognition  patterns.  The  reference  patterns  for  DIGREC  and 
DIGRCT  were  derived  from  a clustering  procedure  applied  to  a design  data  set  collected, 
digitized  and  hand-edited  off-line  before  use  by  the  test  subjects.  The  reference  patterns  for 
RTFNR,  however,  were  derived  from  on-line  enrollment  of  all  the  vocabulary  words  by  each 
subject  using  the  system.  These  were  speaker-specific  reference  patterns.  Both  procedures  are 
described  in  more  detail  in  Sections  III  and  IV. 
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Figure  10.  Word  Recognition  Algorithm  Flow  Chart  (Sheet  2 of  2) 


SECTION  III 

SPEAKER-DEPENDENT  WORD  RECOGNITION 


A.  INTRODUCTION 

The  speech-processing  algorithm  described  in  Section  II  provides  the  framework  for  both 
speaker-dependent  and  speaker-independent  word-recognition  tasks.  The  algorithm  is  made 
specific  to  the  given  task  through  definition  of  the  reference  scanning  and  recognition  patterns. 
The  speaker-dependent  task  is  accomplished  through  definition  of  a single  set  of  reference 
scanning  patterns  lor  each  vocabulary  word  for  each  speaker.  In  contrast,  multiple  scanning 
patterns  are  defined  for  each  word  in  the  speaker-independent  word-recognition  task  (discussed 
in  Section  IV). 

The  reference  patterns  for  the  speaker-dependent  word-recognition  task  are  obtained  in  a 
single  enrollment  session  where  each  word  in  the  vocabulary  is  spoken  in  isolation.  Intersession 
variations  and  contextual  variations  in  continuous  speech  are  accounted  for  by  a method  of 
supervised  updating. 

The  remainder  of  this  section  discusses  the  definition  of  the  reference  scanning  and 
recognition  patterns,  the  method  of  supervised  updating,  and  the  application  of  the  algorithm  to 
continuous  speech. 

B.  ENROLLMENT 

Enrollment  in  the  speaker-dependent  word-recognition  task  defines  the  speaker-specific 
reference  patterns  for  each  word  in  the  vocabulary.  A total  of  20  words  per  speaker  is  allowed. 
Each  word  is  identified  to  the  system  and  then  spoken  four  times  in  isolation.  These  four 
repetitions  are  used  to  define  reference  scanning  and  recognition  patterns  for  the  word. 

The  enrollment  strategy  consists  of  preprocessing  the  data  for  each  word,  locating 
reference  points,  defining  scanning  patterns,  and,  finally,  defining  a recognition  pattern.  The 
preprocessing  step  uses  the  algorithm  defined  in  Section  II  to  provide  the  spectrum,  energy, 
regression  coefficients,  and  T-function. 

The  T-function  is  a measure  of  the  change  in  the  spectrum,  regression  coefficients  and 
energy  and  for  time  tj  is  given  by: 

2 

Tj  = ^ I ||(Aj+k  )N  - (Aj+k_  3 )N  ||2 * *  + l|C'j+k  - f’j+k-  3 II5 

k = 1 1 

+ ~ HEj+k  -Ej+k-3ll2  +4  IIEj+k-2  ~ Ej+k  4H2] 

where 

(Aj)n  = normalized  amplitude  vector  (Appendix  A) 

Cj  = regression  coefficient  vector  = (cj , , Cj2)T 
Ej  = normalized  scanning  pattern  energy. 
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Reference  points  are  located  in  each  of  the  four  enrollment  words  in  turn  for  use  in 
defining  scanning  patterns  and  recognition  patterns.  The  steps  in  locating  the  reference  points  are 
as  follows: 

(1)  Locate  the  beginning,  iST,  and  the  end,  itND,  of  the  word  using  an  energy 
threshold. 

(2)  Sum  the  energy  in  the  word  segment  from  iST  to  if  ND  to  obtain  St. 

(3)  Locate  the  time  points  associated  with  5,  10,  90,  and  95  percent  of  the  energy  sum 
Sj. , that  is  (is,  ii0,  i90,  i9s). 

(4)  Locate  all  the  T-function  peaks  in  the  word  segment. 

(5)  If  a T-function  peak  exists  in  the  interval  [iST,  i10|,  define  its  location  as  the  first 
reference  point,  RP, ; if  not,  let  RP|  = i5. 

(6)  If  a T-function  peak  exists  in  the  interval  [i^,  i|..Ni) )»  define  it  as  the  last  reference 
point,  RPn;  if  not,  let  RPN  = i95. 

(7)  Generate  the  set,  T,  of  all  T-function  peaks  in  the  interval  (RP,,  RPN ).  If  T is  null, 
then  the  word  has  only  the  two  reference  points,  RPt  and  RPN. 

(8)  If  T is  not  null,  use  the  elements  of  T in  all  combinations  to  maximize  the  function: 


F = 


where  i,  = RP, ; iN  = RPN  ; ik  eT  for  k = 2 . . . , N - 1;  Tk  is  the  value  of  the 
T-function  at  ik ; Tmjn  is  a normalization  factor;  and  r is  the  power  for  the  distance 
weighting.  The  subset  of  T that  maximizes  the  function  F is  then  used  as  the  set  of 
reference  points  for  the  word,  with  the  first  reference  point  being  RP,  and  the  last 
being  RPN . The  objective  of  the  maximization  is  to  distribute  the  reference  points 
uniformly  throughout  the  word. 

At  the  location  of  each  of  the  reference  points  thus  defined,  a scanning  pattern  is  defined  as 
discussed  in  Section  II.  The  scanning  pattern  uses  the  spectrum,  energy  and  regression 
coefficients,  and  their  respective  differences  between  time  samples. 

The  definition  of  the  recognition  patterns  also  makes  use  of  the  location  of  the  reference 
points.  The  steps  in  defining  the  recognition  pattern  format  are  as  follows: 

(1)  If  the  energy  is  greater  than  a threshold  for  time  samples  iRP  - 4,  iRP  - 3,  iRP 

2,  and  iRP  1,  then  extrapolation  columns  are  defined  at  iRP  - 4 and  iRP 

- 2. 

(2)  It'  the  energy  is  greater  than  a threshold  for  time  samples  iRP  N + 1,  iRP|SJ  + 2,  iRJ,N 

+ 3,  and  iRP  + 4,  then  extrapolation  columns  are  defined  at  iRPN  + 2 and  iRP^ 

+ 4. 

(3)  Interior  to  every  pair  of  reference  points,  ik  and  ik  + 1,  interpolate  with  M columns 
for  the  recognition  pattern , where 
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The  format  just  described  is  used  to  define  a recognition  pattern  using  the  procedure  outlined  in 
Section  II. 

Once  the  reference  points,  scanning  patterns,  and  recognition  pattern  have  been  obtained 
for  one  of  the  four  repetitions  of  a word  as  described  above,  those  scanning  patterns  are  used  to 
scan  the  remaining  three  repetitions  to  automatically  find  reference  points  and  define  reference 
patterns.  At  each  repetition,  the  new  scanning  and  recognition  patterns  are  averaged  with  the  old 
patterns.  Each  of  the  four  repetitions  of  the  word  are  used  in  turn  to  define  a set  of  reference 
scanning  and  recognition  points.  As  each  reference  pattern  is  formed,  a composite  error 
consisting  of  the  scanning  error  anil  the  recognition  error  is  computed.  The  minimum  composite 
error  over  the  four  different  enrollment  trials  defines  the  ultimate  enrollment  for  the  word. 

An  example  of  automatic  enrollment  is  shown  in  Figure  1 1 for  the  word  “Two”.  For  an 
energy  threshold  of  100,  the  beginning  of  the  word,  iST,  is  at  26  and  the  end,  iEND,  is  at  66. 
The  5-,  10-,  90-,  and  95-percent  energy  sum  points  are  at  29,  33,  55,  and  58,  respectively. 
Therefore,  the  strategy  outlined  above  locates  the  reference  points  at  26,  33,  and  58.  The 
recognition  format  consists  of  three  interpolation  points  between  the  reference  points  at  26  and 
33,  eight  interpolation  points  between  the  reference  points  at  33  and  58,  and  the  extrapolation 
beyond  the  reference  point  at  58  by  two  and  four  points. 

C.  UPDATING 

To  accommodate  intersession  variations  and  continuous-speech  context  variation,  several 
sessions  of  supervised  updating  should  be  performed.  The  updating  should  consist  of  five  sessions 
separated  by  at  least  a day.  A series  of  phrases  that  contain  all  the  transitions  for  the  20-word 
vocabulary  should  be  spoken  continuously.  If  the  phrase  is  recognized,  the  reference  patterns  are 
updated  by  adding  1/16  of  the  new  pattern  to  15/16  of  the  old  patterns.  The  five  sessions 
spaced  at  1-day  intervals  adapts  the  reference  pattern  for  intersession  variations.  The  series  of 
phrases  with  different  contexts  allows  the  reference  patterns  to  adapt  to  continuous  speech  by 
allowing  the  patterns  to  “see”  something  besides  silence  between  the  words  and  also  to  account 
for  coarticulation  which  occurs  in  some  contexts. 

D.  APPLICATION  TO  CONTINUOUS  SPEECH 

The  method  of  word  recognition  using  spectral  pattern  matching  offers  a dramatic 
improvement  in  performance  compared  with  schemes  that  rely  on  finding  word  boundaries  with 
energy  profiles.  The  spectral  pattern-matching  method  works  well  in  continuous  speech  provided 
the  words  are  enrolled  properly  and  several  sessions  of  continuous  speech  updating  are 
accomplished,  as  discussed  in  Subsection  III.C.  The  example  in  Figure  1 1 of  a good  enrollment 
for  the  word  “two”  shows  the  reference  points.  In  the  example,  the  registration  points  were 
interior  to  the  word,  so  that  the  scanning  patterns  will  not  be  confused  with  the  scanning 


patterns  of  adjacent  words.  With  several  sessions  of  continuous  updating  using  phrases  containing 
all  the  word  transitions,  the  scanning  and  recognition  patterns  adapt  to  “see”  all  these 
transitions. 

Unfortunately,  the  method  of  automatic  enrollment  described  above  does  not  always  give 
a good  enrollment.  As  an  example  of  a poor  enrollent,  Figure  12  shows  the  automatic 
enrollment  for  the  word  “six”.  For  an  energy  threshold  of  100,  the  beginning  of  the  word,  iST, 
is  at  25  and  the  end.  i,.ND,  is  at  44.  The  5-,  10-,  90-,  and  95-percent  energy  sums  are  shown  at 
30,  31,  39,  and  41,  respectively.  The  automatic  enrollment  scheme  chose  the  reference  points  at 
29  and  44.  The  recognition  pattern  consists  of  five  columns  between  the  two  reference  points. 
There  should  have  been  another  reference  point  at  56  and  extrapolation  of  the  recognition 
pattern  both  before  the  first  reference  point  at  29  and  after  the  last  reference  point  at  56.  As 
the  patterns  exist  with  the  automatic  enrollment,  the  updating  will  not  improve  the  recognition 
of  the  word  “six”. 

It  is  believed  that  an  improved  automatic  enrollment  algorithm  would  consist  of  a set  of 
speaker-independent  reference  phoneme  patterns,  Given  the  phonetic  spelling,  the  specific 
phoneme  patterns  for  the  word  being  enrolled  would  be  scanned  across  the  input  data  for  that 
word  and  scanning  errors  obtained.  The  minimum  scanning  errors  would  be  located  and  used  in  a 
dynamic  programming  algorithm  to  obtain  the  best  sequence  of  phonemes  in  the  proper  spelling 
order  for  the  word.  T-function  peaks  between  the  minimum  error  locations  for  the  phoneme 
pairs  would  be  used  to  define  the  reference  points  for  the  word.  The  recognition  pattern  format 
would  be  specific  to  the  phonemic  spelling  of  the  word.  Once  the  reference  points  and 
recognition  pattern  format  are  defined,  the  enrollment  procedure  would  be  the  same  as  defined 
above. 
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Figure  11.  AutomaTic  Enrollment  for  " 


1 1 
9t 
8 
9 
9 

I 
E 
9 
8 

I I 
el 
01 
00? 
9?? 
?el 
?9  I 
on  i 
9i 

£ ? 
Bti 


?9 
9b 
tu 
£ f l 
bte 
9££ 
96  I 
0/ 
89 
/9 
89 
66 
£8 
£6 
?? 

£ I 
9£ 
06 
60? 
666 
£ 69 

9 0 £ 
68 
6 
It 

17  I 

l 1 
6 
I? 

L 

?t 

l? 

*>9 

Z8 

1/ 

9? 

I 

0 

0 

0 

0 

n 

o 

« 

n 

o 

0 

0 

0 


♦ ♦♦ 


♦ ♦ 


♦ ♦ 

♦ ♦♦ 
♦ ♦ 


(**z**)  ( 

(==♦♦♦)  t 
(=♦♦4*) ( 
(♦==♦♦) ( 
(♦=♦♦=) ( 
(♦♦♦*=) ( 
(*♦:::! ( 
(*:::♦) ( 
(♦=♦♦=) ( 
(+♦♦♦=) ( 

( = ==♦♦)( 
<==♦*♦) ( 
(==♦♦♦) ( 
(=4444)  ( 
(♦♦=♦♦)  ( 
(♦=♦♦=) ( 
(♦♦♦=♦) ( 
(♦♦=♦♦) ( 
(♦=♦♦=) ( 
(♦♦♦♦0) ( 
(„„»=o)  ( 
(”♦00)  ( 
0(  .=00)  ( 
00  ( '*==0)  ' 
00  ( m ♦ ♦ = 0 
000(444=0 
000(„44=0) ( 
0000(444=0) ( 
ooooo(«44=n) ( 
000000 (.44=0) ( 
00000000(„4===) ( 
0000000000(44==4) ( 
oooooooooo(4  = = 4„)  i 
00000000(00=4  )( 

oooootoo*'  )( 
ootto, 
(s=r 

(0.444 
( ♦ ♦ = = 4 
(♦==44 

(o=44*: 
(0=444 
(==♦♦♦. 
(==♦♦♦) 
(==♦♦♦) 
(==♦♦♦ 
(=♦♦♦♦) ( 
(=♦♦♦♦) ( 
(♦♦♦♦♦)  ( 
( ♦♦♦♦♦ ) ( 
(♦♦♦♦♦) ( 
(♦♦♦♦♦) ( 
(♦♦♦♦♦) ( 
(♦♦♦♦♦) ( 
(♦♦♦♦♦) ( 
(♦♦♦♦♦ 
(♦♦♦♦♦) ( 
(♦♦♦♦♦1 ( 
(♦4444) ( 
(*+♦♦♦) ( 
(♦♦♦♦♦) ( 
(44444) ( 
(44444) ( 
(44444) ( 


tO)  (S 
SO)  (S 

to)  (1 

SO)  (t 

so)  (s 

to)  (t 
to)  ft 
to)  (t 
to)  (s 
to  (S 
to)  (t 
to)  (t 

t = ) (S 
t .. ) ( 0 
0 ) (" 
0‘)  ( ', 
0 <i ) ( * 

04)  ( ' 
0.)  ( 
0.)  ( 
C+)  ( 

= 4)  ( " 
C*)  ( ' 
C.)  ( 
==)  ( • 
'0)  ( 
'0)  ( 
„0)  ( 

*8  1 
0)  ( 
f ) c 


M M ♦ ♦ ♦ M * 
* 

M !•♦♦♦•.  * 
..♦♦♦. ' 
..♦♦♦»  • 
,4444.' 
,.♦♦♦♦. ' 

..444.  ' 

»«♦♦♦  ' 


004 


1111 


i i 


)88 
) 09 
)96 
)6£ 
) tB 
)?6 
Z£ 
>*£ 
)96 
) 66 
) 91? 
)?6 
)?9 

■ 114 


''Of*  '**)99 

.,004  ' ' ' "♦♦)()£ 

-)0I 
’ ' , )9l 

" Mil 

♦ ♦ = .'"♦.,'»  0? 
4 4 » t • 11  I C * * 

*."*.1.  1 


t)  ( 

Ot) (t ‘ 

tots 

to)  (t 

SO)  (S 


==0*4 ' 
00t=, * 
0 = t=  ' 

o.to' 

0,10' 

o’to' 

o'to* 

= 'to. 

4'00. 
4.00. 
..00. 
..00. * 
. .00*  ' 
..004' 


69 
89 
l 9 
99 
99 
l?9 
£9 
?9 
19 
09 
69 
89 
L 9 
99 
99 
69 
£9 

H 

(9 
09 
Q 66 
Z 86 
ft  uj 

')0t  " 96 
'.  j t?  ^.96 
04  )f  U*-  »»+— 

'0+  j 16?  ,£6 

' 0* ) £ ?6  ui?6 

' = 4)i09oil6 
. = ♦)  ££9  a'  06 
♦ 0 , ) 89  Z 06£ 

♦O'  ) 968  8£ 

,0')£90l  Z£ 
7t') 1 6£ l 9£ m 
‘t  ) f 9 Z l 9f 
't  )980?„6£ 

't')zeo?£ff 

' S • ) 96  Z t2?£ 

0 = ) 960  l .If 
00 ) ??6  mOf 
' - ) E 8 6?-*- 

9? 

1? 

9? 

66  ■•“9s' 

Z9 
?£ 

>61 
)0l 
) 9 


'..4*4..'  ) 06 

".♦♦♦..  I £ ? t 
'..♦  = ♦..  ) ?£  I 


\:U 


h l>c 

-I! 
0? 
61 
81 

91 

l\ 
£ ) 
?I 
IT 
01 
6 
8 
l 
9 
9 
6 
£ 

? 

T 


A 


I 


32 


Figure  12.  Automatic  Enrollment  for  “Six” 


SECTION  IV 


REFERENCE-PATTERN  GENERATION  FOR 
SPEAKER-INDEPENDENT  WORD  RECOGNITION 

The  creation  of  reference  patterns  for  speaker-dependent,  isolated  word  recognition  is 
fairly  straightforward:  extract  patterns  from  a single  enrollment  session  and  accommodate 
intersession  variation  through  updating  of  the  reference  patterns  (learning  with  a teacher).  For 
speaker-independent  word  recognition,  however,  the  increased  variance  of  input  data  from 
singe-reference  templates  because  of  dialect,  idiolect,  and  actual  physical  characteristics  of  the 
speaker  (length  and  shape  of  the  vocal  tract,  pitch,  etc.)  requires  a more  complex  approach.  Of 
course,  allowing  continuous  speech  input  exacerbates  the  problem  with  the  introduction  of  1 

contextual  variations.  An  obvious  solution  is  to  allow  multiple  reference  templates  for  each 
word.  One  approach  to  deriving  a set  of  multiple  reference  templates  from  a design  data  set  is  to 
partition  the  data  set  on  the  basis  of  information  other  than  the  actual  data,  such  as  sex  or 
linguistic  background.  A second  approach,  the  one  used  in  the  total  voice  verification  study  and 
in  the  present  study,  is  to  partition  the  data  on  the  basis  of  the  data  points  themselves  using 

clustering  techniques.  The  remainder  of  this  section  reviews  the  use  of  clustering  in  speaker-  / 

independent  reference  template  generation,  discusses  the  clustering  used  in  the  studies  at  Texas 

Instruments,  and  gives  results  of  some  further  analysis  of  the  patterns  developed  during  the  total 

voice  work,  patterns  that  were  also  used  on  the  current  unconstrained  digit  recognition  work  in 

order  to  preserve  compatibility. 

A.  REVIEW  OF  CLUSTERING  IN  SPEAKER-INDEPENDENT  WORD  RECOGNITION 

Except  for  the  work  done  in  this  study  and  in  the  total  voice  study,1  the  only  other 
applications  of  clustering  to  speaker-independent  reference-pattern  generation  has  been 
concurrent  work  started  independently  about  the  same  time  at  Bell  Laboratories  (as  reported  by 
Rabiner.  Levinson,  Rosenburg,  and  Wilpon19  22  ) and  subsequent  independent  work  done  in 
Japan  by  Tanaka.23,24  The  application  at  Bell  Laboratories  is  isolated  word  recognition,  and 
Tanaka’s  procedure  has  been  applied  only  to  the  recognition  of  stop  consonants.  The  remainder 
of  this  subsection  reviews  these  other  works. 


I9L.R.  Rabiner,  "On  Creating  Reference  Templates  for  Speaker-Independent  Recognition  of  Isolated  Words," 
//■./■/.  Transactions  on  Acoustics,  Speech  ami  Signal  Processing,  ASSP- 26:34  42.  February  1978. 

20S.H.  Levinson  et  al.,  "Interactive  Clustering  Techniques  for  Selecting  Speaker-Independent  Reference 
Templates  lor  Isolated  Word  Recognition.”  //:/■./:  Transactions  on  Acoustics,  Speech  and  Signal  Processing, 
ASSP-27 : 1 34  141 , April  I 979. 

2IL..R.  Rabiner  el  al..  “Speaker-Independent  Recognition  of  Isolated  Words  Using  Clustering  Techniques," 
Proceedings  of  the  International  Conference  on  Acoustics,  Speech  and  Signal  Processing,  Washington,  D.C., 
574  577,  2 4 April.  1979. 

22L.R.  Rabiner  and  J.G.  Wilpon.  "Considerations  in  Applying  Clustering  Techniques  to  Speaker-Independent 
Word  Recognition,"  Proceedings  of  the  international  Conference  on  Acoustics,  Speech  and  Signal  Processing, 
Washington.  D.C.,  578  581,  2-4  April,  1979. 

2,K.  Tanaka.  "A  Standard  Category  Pattern-Making  Method  With  Application  to  Phoneme  Recognition,” 
/Proceedings  oj  the  Fourth  International  Joint  Conference  on  Pattern  Recognition,  Kyoto,  Japan.  1030  1032, 
7-10  November,  1978. 

24 K.  Tanaka.  “A  Talker  Clustering  Method  for  Standard  Pattern  Making.”  Progress  Report  on  Speech 
Research  ‘77, Electrotechnical  Laboratory,  Japan.  August  1978. 


Although  the  Bell  Laboratories  work  has  been  for  words  said  in  isolation,  the  word  sets 
investigated,  while  including  the  digits  (0  through  9),  were  considerably  largei.  One  set  was  a 
54-word  vocabulary  proposed  originally  by  Gold,25  and  the  other  set  contained  the  alphabet,  the 
digits,  and  three  control  words.  The  speech  representation  chosen  was  a set  of  linear  predictive 
coding  ( LPO  parameters  for  each  15-ms  frame  of  speech.  These  parameters  then  underwent  a 
time  normalization  using  a dynamic  programming  technique.26  The  similarity  measure  used  in 
the  Bell  Laboratories  work  was  one  of  the  following  form  proposed  by  Itakura:7 


d | k,  w(k)|  = log 


aw(k)  ^ aw(k ) 

\ V^k 


(5) 


where  ak  is  the  vector  of  LPC  coefficients  associated  with  the  kth  frame  of  the  test  or  unknown 
utterance  x;;  aw(k)  is  the  vector  of  LPC  coefficients  derived  from  the  w( k )th  frame  of  the 
reference  utterance  Xj ; and  V is  the  matrix  of  autocorrelation  coefficients  computed  from  the 
kth  frame  of  the  test  utterance.  Note  that  this  distance  measure  is  not  a true  metric  since  it  is 
not  symmetrical. 


The  clustering  technique  reported  in  Levinson  et  al.,20  and  Rabiner  et  al.,21  is  a 
supervised,  interactive  procedure  and  is  the  combination  (figure  13)  of  the  following  four 
procedures:  chainmap,  shared  nearest  neighbor,  k-means,  and  a version  of  ISODATA.  The  details 
of  their  procedures  are  given  in  Levinson  et  al.20  In  this  approach,  the  investigators  first 
attempted  to  find  good  estimates  of  both  the  number  of  clusters  (using  the  chainmap)  and  their 
cluster  centers  (using  the  k-means)  for  input  to  an  iterative  optimization  procedure  (ISODATA) 
that  allowed  splitting  and  merging  of  clusters.  The  overall  intent  was  to  maximize  a quality 
measure  a for  the  assignment  of  N observations  into  M classes.  The  value  of  a is  given  by 


M M 


M(M  1) 


a = 


1 

M 


EE  5(vw”) 

>=1  J=l 

m - in 

E oiiTmT”^  1,  E E s (x  <") 


(6) 


m.(m.  I 
i=i  j=]  k=l 


where  superscripts  indicate  class  membership,  p subscripts  indicate  reference-class  prototypes,  and 
6(a,b)  is  a nonsymmetric  similarity  measure  between  patterns  a and  b that  is  the  average  of  the 
Itakura  distances  over  all  the  frames  of  the  reference  pattern.  Further  comments  regarding  a are 
made  later  in  this  section. 

25 B.  Gold.  “Word- Recognition  Computer  Program,”  Massachusetts  Institute  of  Technology,  Cambridge.  RLE 
Technical  Report  452,  June  1966. 

26  L R.  Rabiner  et  al.,  “Considerations  in  Dynamic  Time  Warping  Algorithms  for  Discrete  Word  Recognition,” 
IEEE  Transactions  on  Acoustics,  Speech  and  Signal  Processing,  ASSP-26:575  582,  December  1978. 
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Figure  13.  Bell  Laboratories  Clustering  Procedures  (From  Rabiner,  et  al2 ‘ ) 

Speaker-independent  recognition  results  for  isolated  digits  given  in  Rabiner  et  al.21  range 
from  97.5  to  100  percent.  Results  for  the  entire  39-word  set  range  from  about  50  to  80  percent, 
and  recognition  improves  with  the  number  of  reference  templates  used  for  each  word. 

Rabiner  and  Wilpon22  extended  the  previous  w 'rk  to  unsupervised  clustering  using  the 
same  data  set,  distance,  and  quality  measure  (a)  as  previously  used.  One  clustering  algorithm  uses 
only  precomputed  distances  between  observations,  attempting  to  place  each  observation  uniquely 
in  a cluster  with  all  others  that  are  similar,  and  a second  clustering  algorithm  combines  (by 
averaging)  observations  that  are  similar.  Comparisons  were  made  in  this  work  among  three 
different  LPC  feature  sets  and  between  cluster  representation  either  by  the  data  point  with  the 
minimum  maximum  distance  from  all  other  points  in  the  cluster  or  by  the  average  of  all  the 
points  in  the  cluster.  The  results  of  Rabiner  and  Wilpon  indicate  that  the  algorithm  using 
precomputed  distances  was  superior  to  the  other  and  that  the  use  of  an  averaged  pattern  to 
represent  the  cluster  was  superior  to  using  the  minimum  maximum  center.  Again,  the  recognition 
accuracy  improved  with  the  number  of  reference  templates  used. 

Tanaka23’24  clusters  a set  of  observations  into  different  classes  by  moving  each 
observation  iteratively  by  some  amount  proportional  to  the  density  of  points  in  the 
neighborhood  of  the  observation,  where  each  point  is  modified  by  its  gradient  with  respect  to  all 
other  points.  Specifically,  for  the  set  of  observation  vectors  xt  (i  = 1,...,  N)  for  the  j th 
iteration. 


where 


x'+l  = 


,E 

i i-i 


6(sJ) 


x1  + 

k V 


w 


Yy*  (*i  *0 

r- 1 


x"J ) 6(  2sJ ) 


(7) 


fits1)  = exp  | | d ( x ! . xJk  )| 2 / 2( s1 ) } 
6(  2sJ ) = exp  | | d(xJk,  x>r  )| 2 /2( 2s^ )[ 
N 

w|+  1 = 5(sJ ) 

k=  I 


and 

si  = si  i/V2 

Tanaka  makes  the  analogy  to  a potential  function  of  an  exponential  form,  so  that  the  (xik 
xir)  6(2si)  term  can  be  considered  a gradient  of  the  potential  function.  Hence,  the  term  in 
brackets  in  Equation  (7)  represents  one  of  the  points  modified  by  the  gradient  of  a potential 
function  of  that  point  with  respect  to  all  other  points.  These  modified  points  are  then  used  in  N 
weighted  sums  to  determine  each  of  the  new  N points  for  the  (j  + 1 )xf  iteration.  Clustering  stops 
when  the  window  6(si)  has  narrowed  sufficiently  that  every  data  point  is  the  same  as  that  of  the 
previous  stage.  The  number  of  iterations  and  the  final  number  of  clusters  are  obviously  affected 
by  the  choice  of  c and  s°. 

This  approach  is  quite  similar  to  that  presented  by  Eukunaga  and  Hostetler.27  However, 
they  essentially  use  the  points  as  modified  in  the  brackets  in  Equation  (7)  as  the  new  N points 
for  the  (j  + 1 ).v/  iteration,  instead  of  using  weighted  sums  of  such  points. 

Tanaka  applied  this  method  to  generation  of  reference  patterns  for  use  in  the  difficult 
problem  of  detecting  the  three  stop  consonants  /p,t,k/.  This  effort  is  directed  toward  a phonetic 
classification-based  speech  recognition  system.  Tanaka’s  results  are  89  percent  for  the  test  data 
and  82  percent  for  stop  consonants  of  other  speakers. 

The  method  used  in  the  study  covered  by  this  final  report  and  the  method  used  in  the 
preceding  total  voice  study  differ  significantly  from  the  approaches  just  described.  Tanaka’s 
approach  differs  not  only  in  terms  of  using  phonemic-based  recognition  but  also  in  the  clustering 
by  allowing  movement  of  the  data  points.  Although  differing  final  clusters  and  numbers  of 
clusters  could  be  produced  in  Tanaka’s  algorithm  by  varying  the  parameters,  he  does  not  discuss 
how  to  choose  the  final  clusters.  Differing  applications  do  not  allow  comparison  of  his  final 
results  to  those  presented  here. 


27K.  Fukunagu  and  L.D.  Hostetler,  “The  Estimation  of  the  Gradient  of  a Density  Function,  With  Applications 
in  Pattern  Recognition,”  //:/:/'  Transactions  on  Information  Theory,  IT-2 1:32  40,  January  1975. 
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Although  the  application  and  the  clustering  approach  used  at  Bell  Laboratories  differ  less, 
their  approach  is  to  find  a single  “best"  set. of  clusters,  ultimately  using  an  ISODATA  algorithm 
that  can  split  and  merge  clusters.  The  approach  used  at  Texas  Instruments,  however,  finds  good 
estimates  tor  cluster  centers  using  a hierarchical  clustering  approach,  performing  an  iterative 
optimization  on  cluster  definitions  for  several  fixed  values  of  M (later  referred  to  as  “c”),  and 
choosing  M based  not  only  on  criterion  values  for  each  of  the  final  partitions,  but  also  on  a 
subjective  evaluation  of  the  final  cluster  averaged  patterns  themselves. 

A criterion  similar  to  that  used  by  Bell  Laboratories  (o)  is  compared  with  the  one  used 
here  (trace  ol  the  within-elass  scatter  matrix)  later  in  this  section.  In  addition,  although  a 
common  data  base  could  not  be  used,  a small  test  of  recognition  performance  on  isolated  digits 
was  performed  to  provide  a rough  comparison  with  the  isolated  digit  results  presented  by 
Rabiner  et  al.2 1 

B.  DETAILED  CLUSTERING  ALGORITHM 


The  clustering  algorithm  used  in  the  total  voice  study  and  extended  in  the  present  study 
represents  a unique  combination  of  several  methods,  all  centered  on  the  use  of  Euclidian 
distances  because  the  fast  vector  comparator  exists  peripheral  to  the  Tl  980  to  perform  the 
computation.  The  entire  procedure  is  shown  in  Figure  14.  The  patterns  used  in  the  speaker- 
independent  digit-recognition  evaluations  were  generated  during  the  total  voice  study  using  the 
path  through  the  procedure  denoted  by  the  double  lines  in  Figure  14.  The  other  paths  in  Figure 
14  were  added  during  this  study  for  evaluation  of  and  consistency  checks  on  the  previous 
patterns  and  for  rudimentary  outlier  analysis. 


A detailed  description  of  the  procedure  used  to  generate  the  patterns  and  the  patterns 
themselves  are  given  in  the  total  voice  final  report.1  A brief  description  of  the  procedure  is  given 
here  for  completeness.  The  method  used  in  the  total  voice  study  to  derive  the  patterns  used  in 
the  evaluation  was  an  agglomerative  method  combining  the  two  clusters  that  have  the  smallest 
average  distance  (MINAVF)  between  the  points  in  the  two  clusters,  i.e. , combining  the  i and  j 
clusters  that  have  the  minimum 


d(x,  x') 


(8) 


where 

nj  = number  of  x*in  class  Xj 
nj  = number  ofx’  in  class  Xj 
and,  in  this  case, 

d(x, IT)  = ||x  - x'||2 

The  second  step  used  was  to  improve  on  the  partitions  from  the  hierarchical  clustering 
iteratively  by  moving  samples  from  one  group  to  another  if  such  a move  improved  the  value  of 
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Figure  14.  Block  Diagram  of  Clustering  Procedure  Developed  for  Clustering  Scanning  and 
Recognition  Patterns  for  Speaker-Independent  Digit  Recognition 


some  criterion  function.  This  step  used  the  iterative  optimization  method  of  Duda  and  Hart28 
that  minimized  the  sum-o ('-squared  error  criterion  Je,  written  as 


Eg 


where 


if  a point  x is  moved  from  class  x(  to  class  x,,  the  means  m|  and  mj  change  to 


__  _ x - mj  x - m 

m.*  = m.  — and  m * = m . + — . 

’ 1 n;  1 J J a + 1 


The  value  of  decreases  to 


and  increases  to 


j * = j . _ — : _ 

1 1 n.  1 

I 


II x - m.||2 


V“jj+  inf-TBp 


Clearly,  then,  since  the  criterion  is  to  minimize  J , if 


llx-  nrll2  < — — j-  |[x  - frijll2  (13) 

then  x should  be  transferred  from  class  Xj  to  class  Xj-  Specifically  the  point  7 is  moved  to  the 
class  Xj,  having  the  smallest  (nj/nj  + 1)  ||jr-  m j 1 1 2 . 

An  additional  property  (not  necessarily  good)  of  the  selection  of  je  as  a criterion  is  that  a 
set  of  equally  divided  clusters  is  favored  over  a set  containing  both  small  and  large  clusters,  as 
noted  earlier.  This  can  be,  seen  by  considering  n(  > iij  in  Equation  (13),  which  yields 
approximately 


||x  - m ||2  < ||  x m.||2 
n(  + 1 1 1 


1 R.O.  Duda  and  P.E.  Hart,  Pattern  Classification  and  Scene  Analysis.  John  Wiley  and  Sons  (New  York,  1973). 
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Thus,  for  iij  = I,  the  distance  |fx“  tn^l2  need  only  be  less  than  twice  the  distance  If?  m, II2  to 

the  old  mean  for  x‘  to  the  transferred  to  class  xr 

C.  CRITERIA  FOR  MEASURING  PARTITION  GOODNESS 

Although  minimization  of  Jt,  was  the  criterion  used  in  the  iterative  optimization  and  (Jj; 
Jii+1  )/J'  was  used  as  a second  criterion  to  aid  in  selecting  the  number  of  clusters,  the  values  of 
several  other  criteria  related  to  J0  were  calculated  during  the  current  study  for  all  patterns  for 
numbers  of  classes  from  I to  I N (=  10).  (Superscripts  on  Jc  are  used  to  denote  number  ot 
classes. ) 

The  discussion  in  the  remainder  ol  this  subsection  assumes  a knowledge  of  scatter 
matrices.  Appendix  B has  been  provided  for  those  not  familiar  with  the  concept. 

A third  criterion  is  the  value  of  tr  SB/tr  Sw , which  is  inherently  maximized  during  the 
iterative  optimization  by  the  minimization  ol  Jc  (=  tr  Sw  ).  Note  that  since  tr  Sj  = Jc  1 and  tr 
SH  = tr  S,  tr  Sw,  then  tr  SB/tr  Sw  = (Jc‘  )/Jg  for  e classes. 

A fourth  related  criterion  suggested  by  Uartigan29  for  choosing  the  number  ot  clusters  is 
(n  c- >< J e J^;+|  )/}$.  Uartigan  suggests  that  values  of  this  ratio  greater  than  10  justify 

increasing  the  number  of  clusters. 

A fifth  criterion  is  related  to  the  F-ratio  from  analysis  of  variance,  taking  into  account  the 
degrees  of  freedom  of  tr  SB  and  tr  Sw . This  criterion  is  given  by 

tr  SB/(c  I)  (n  c)(Je'  J") 


trSw/(n  c) 


(c  - 1 ) J" 


and  is  attribted  by  Hveritt30  to  Calinski  and  Harabasz.31 

The  sixth  criterion  calculated  during  this  study  is  a o analogous  to  that  used  in  the  Bell 
Laboratories  studies.19  22  The  value  of  a is  calculated  by 


-1  — yyt  || m 1 

c(c  I ) l,  it  i 1 

i=l  j--l  


29J.A.  Uartigan,  Clustering  Algorithms,  John  Wiley  and  Sons  (New  York,  1975). 

30B.  Everitt,  Cluster  Analysis,  lleinemann  Educational  Books,  Ltd.,  (London,  1974). 

3iT.  Calinski  and  J.  Harabasz,  “A  Dendrite  Method  for  Cluster  Analysis,”  (unpublished),  1971. 
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The  relationship  between  a and  the  tr  SB/tr  Sw  criterion  can  be  seen  better  by  putting 
both  a and  criterion  5 in  equivalent  forms.  The  o term  can  be  rewritten  as 


a 


(17) 


and  the  fifth  criterion  multiplied  by  the  factor  c/(2n)  can  be  rewritten  (see  Appendix  C)  as  a 
seventh  criterion  a,  as  follows: 


c(tr  SB  )/(c  1 ) 

2n(tr  Sw)/(n  - c) 


(18) 


The  values  of  all  seven  of  these  criteria  for  several  classes  of  one  of  the  34  pattern  types 
clustered  in  this  study  are  shown  in  Table  5.  The  desire  is  to  maximize  the  last  five  of  the  seven 
criteria  discussed  above.  Note,  however,  that  the  sixth  criterion,  o,  is  actually  omy  monitored, 
while  the  optimization  is  on  the  basis  of  Je,  from  which  all  the  other  criteria  (except  o)  are 
derived. 


TABLE  5.  VALUES  OF  SEVEN  CLUSTERING  CRITERIA  FOR 
POSTITERATIVE  OPTIMIZATION  OF  MINMAX  AGGLOMERATIVE 
CLUSTERS  FOR  RECOGNITION  PATTERN  FOR  DIGIT  “SIX” 


c 

1 

2 

3 

Criterion 

4 

5 

6 

7 

— 

— 

— 

— 

— 

— 

1 

58,412.0 

0.176 

0.000 

28.850 

0.000 

0.000 

0.000 

2 

48,136.5 

0.051 

0.213 

8.296 

34.795 

0.422 

0.208 

3 

45,686.7 

0.055 

0.279 

8.897 

22.561 

0.426 

0.203 

4 

43,177.7 

0.022 

0.353 

3.473 

18.935 

0.518 

0.227 

S 

42,246.4 

0.030 

0.383 

4.771 

15.306 

0.530 

0.229 

6 

40,986.8 

0.019 

0.425 

2.943 

13.520 

0.486 

0.243 

7 

40,228.1 

0.036 

0.452 

5.652 

11.903 

0.546 

0.249 

8 

38,789.0 

0.003 

0.506 

0.543 

1 1 .346 

0.601 

0.272 

9 

38,654.9 

0.021 

0.51 1 

3.343 

9.987 

0.601 

0.269 

10 

37,826.5 

_ 

0.544 

9.372 

0.633 

0.281 
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D.  DESCRIPTION  OF  CLUSTER  ANALYSIS  DOCUMENTATION 


This  subsection  gives  a brief  description  of  the  printouts  available  from  an  analysis 
program  that  was  run  on  the  output  data  Irom  the  entire  clustering  procedure  shown  in  Figure  9 
for  each  of  the  34  pattern  types.  A subset  of  the  available  outputs  is  presented  in  Appendix  D. 
Samples  of  available  printouts  are  given  in  this  subsection  tor  the  scanning  pattern  I or  reference- 
point  I for  the  digit  zero.  MINAVF  is  used  to  refer  to  agglomerative  clustering  by  combining  the 
two  clusters  having  the  minimum  average  distance  between  all  points  in  the  two  clusters. 
Corespondingly,  M1NMAX  refers  to  agglomerative  clustering  by  combining  the  two  clusters 
having  the  minimum  maximum  distance  between  the  two  clusters. 

1.  Trees 

The  first  type  of  output  available  is  a tree  (dendogram)  showing  the  final  joinings  or 
agglomerations  in  the  hierarchical  procedures  and  is  available  for  all  four  branches  of  the 
algorithm  shown  in  Figure  9.  Accompanying  each  dendogram  is  a table  showing  the  values  of  the 
joining  criterion  for  each  level  and  the  relationship  of  the  criterion  values  to  the  dendogram.  The 
tree-printing  subroutine  was  adapted  from  appendix  (J  of  Anderberg.32  The  tree  for  the 
MINAVF.  hierarchical  clustering  using  all  samples  is  shown  in  Figure  15  for  the  joining  criteria 
values  in  Table  6. 

2.  Parameter  Comparisons 

The  second  type  of  output  from  the  analysis  program  gives  the  values  of  the  six  criteria 
described  in  Subsection  IV. C,  the  values  of  the  errors  during  the  agglomerative  clustering,  and 
the  number  of  iterations  required  in  the  iterative  optimization  to  reach  the  final  partitions.  The 
conditions  for  each  of  the  parameter  comparisons  produced  and  a reference  to  the  figure 
showing  an  example  of  that  comparison  are  listed  below: 

MINAVE,  all  points;  pre-  to  postiterative  optimization  comparison  (Figure  16) 
MINAVF,  outliers  discarded;  pre-  to  postiterative  comparison  (Figure  17) 

MINMAX,  all  points;  pre-  to  postiterative  optimization  comparison  (Figure  18) 
MINMAX,  outliers  discarded;  pre-  to  postiterative  comparison  (Figure  19) 

Preiterative  optimization,  all  points;  MINAVF  to  MINMAX  comparison  (Figure  20) 
Postiterative  optimization,  all  points;  MINAVF  to  MINMAX  comparison  (Figure  21). 

3.  Consistency  Tests 

The  class  assignments  obtained  using  the  MINAVF  and  the  MINMAX  hierarchical 
clustering  procedures  after  iterative  optimization  are  compared  for  the  number  of  clusters  ranging 
from  2 to  FN  (=  10).  This  comparison  is  in  terms  of  two  contingency  matrices  such  as  shown  in 
Figure  22  for  10  classes  after  iterative  optimization.  This  output  first  lists  the  members  of  each 
class  for  the  iteratively  optimized  results  of  the  MINAVF  agglomerative  clustering  followed  by 
those  from  the  MINMAX  agglomerative  clustering.  The  first  contingency  table  then  compares  the 


32  M R Anderberg,  Ouster  Analysis  for  Applications,  Academic  Press  (New  York,  1973). 
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TABLE  6.  CRITERION  VALUES  FOR  TREE  FOR  FINAL  24  STAGES 
OF  MINAVE  AGGLOMERATIVE  CLUSTERING  OF  ALL  (166)  SCANNING 
PATTERNS  FOR  FIRST  REFERENCE  POINT  OF  DIGIT  “ZERO” 


Stage 

Class 

1 

J 

Criterion 

Absolute 

Relative 

1 

16 

22 

403.692 

1 

2 

7 

23 

406.000 

1 

3 

2 

6 

408.710 

1 

4 

7 

15 

414.500 

1 

S 

10 

19 

414.750 

1 

6 

9 

21 

418.000 

2 

7 

1 

II 

419.967 

2 

8 

9 

18 

429.875 

3 

9 

13 

20 

439.000 

3 

10 

2 

3 

441.1  15 

3 

1 1 

4 

8 

461.250 

5 

12 

14 

16 

462.714 

5 

13 

1 

12 

464.742 

5 

14 

1 

2 

488.888 

7 

15 

1 

5 

509. 1 95 

9 

16 

1 

10 

527.124 

10 

17 

4 

24 

530.000 

10 

18 

1 

14 

538.098 

11 

19 

4 

25 

542.947 

1 1 

20 

1 

13 

573.328 

14 

21 

1 

9 

626.332 

18 

22 

1 

7 

655.048 

20 

23 

4 

17 

664. 1 50 

20 

24 

1 

4 

729.398 

25 

members  of  the  classes  from  each  partition,  with  each  entry  in  the  table  showing  the  number  of 
points  that  are  members  of  both  classes.  The  second  contingency  table  compares  data  points  in 
pairs  for  joint  membership  or  lack  of  joint  membership  in  the  same  class.  In  particular,  if  two 
data  items  are  in  different  classes  in  a partition,  this  fact  is  denoted  by  a 1 in  row  or  column  1. 
Otherwise,  a I appears  in  row  or  column  2.  Hence,  the  (2,2)  entry  in  the  contingency  table 
indicates  how  many  pairs  of  the  N(N  I )/2  pairs  of  points  are  in  the  same  class  for  both 
partitions,  and  the  (1,1)  entry  indicates  how  many  of  the  pairs  are  in  different  classes  for  both 
partitions. 

Ideally,  for  both  contingency  tables,  all  off-diagonal  elements  will  be  0.  Hence,  a measure 
of  the  closeness  of  the  two  partitions  in  both  cases  is  the  sum  of  the  diagonal  entries  divided  by 
the  sum  of  all  the  entries  in  the  table  |N  for  the  first  table  and  N(N  — 1 )/2  for  the  second 
table  | . 


t 
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STATISTICS  FOR  OIGIT:  0;  REF  PT:  J;  NO  OF  DATA  PIS:  lbb 
MINAVE  AGGLOM  CLUSTERING;  23  FE67* 

PRE  ANO  POST  ITERATIVE  OPTIMIZATION  FOR  MIN  JE 


NO  OF 


c 

ERR 

ITERS 

1 

0.0 

0 

2 

729.4 

8 

3 

664.1 

80 

4 

655.0 

93 

5 

626.3 

139 

6 

573.3 

124 

7 

542.9 

141 

8 

538.1 

96 

9 

530.0 

60 

10 

527.1 

75 

1 1 

509.2 

-1 

12 

488.9 

-1 

13 

464.7 

-1 

1 4 

462.7 

-1 

15 

461.3 

-1 

16 

441.1 

-1 

17 

4 39.0 

-1 

18 

429.9 

-1 

19 

420.0 

-1 

20 

418.0 

-1 

21 

414.8 

-1 

22 

414.5 

-1 

23 

408.7 

-1 

24 

406.0 

-1 

25 

403.7 

-1 

JE  (=TR(w)) 


JE(C)-JEIC+1) 

/Jfc(C) 


Tk (W) /TR (a) 


PR  E 

46284. 
40730. 
39361 . 
3ft  1 18. 
36216. 
35626. 
35252. 
32933. 
32559. 
31768. 
30486. 
27668. 
26542. 
25724. 
25276. 
23934. 
23703. 
23454. 
23186. 
22808. 
22570. 
22343. 
21729. 
21497. 
21114. 


POST 

8 46284.9 

9 40504.1 
8 35711.4 
8 33671.6 

8 31628.2 

9 30504.0 
2 29106.6 
0 28154.5 
9 27385.4 
7 26902.9 
6******** 
6******** 

7******** 

5*  ******* 

0******** 

8******** 


0******** 
7 ******** 


PRE 

POST 

PRE 

POST 

0.120 

0.125 

0.000 

0.000 

0.0  34 

0.116 

0.136 

0.143 

0.032 

0.057 

0 . 1 7 o 

0.296 

0.050 

0.061 

0.214 

0.375 

0.016 

0.036 

0.276 

0.463 

0.011 

0.046 

0.299 

0.517 

0.066 

0.033 

0.  313 

0.59o 

0.011 

0.027 

0.405 

0.644 

0.024 

0.016 

0.422 

0.690 

0.040******* 

0.457 

0.720 

0.092 

0.000 

0.518  -0.000 

0.041 

0 . 000 

0.673  -0.000 

0.031 

0 . 000 

0.744  -0.000 

0 . O 1 7 

0 . 000 

0.799  -0.000 

0.053 

0.000 

0.831  -0.000 

O.01O 

0.000 

0.934  -0.000 

0.011 

0 . 000 

0.953  -0.000 

0.011 

0.000 

0.973  -0.000 

0.016 

O.ooo 

0.996  -0.000 

0.010 

0 . 0 0 0 

1.029  -0.000 

0.010 

0.000 

1.051  -0.000 

0.027 

0.000 

1.072  -0.000 

0.011 

0.000 

1.130  -0.000 

0.018 

0.000 

1.153  -0.000 

* * * * * 

0.000 

1.192  -0.000 

CCN-C)TR(B) 


(N-C ) *UELJt  (N-C) *TR (b) 


C 

/2N(C-1)TR(«) 

BTL'S 

SIGMA 

/JE  (C) 

/(C-1)*TK(W) 

PRE 

POST 

PRE 

POST 

Pwt 

PuSl 

1 

0.000  0.000 

0.000 

0.000 

19.799 

20.608 

0.000 

0.000 

2 

0.135  0.141 

0.57  9 

0.499 

5.512 

19.40b 

22.363 

23. 40b 

3 

0.130  0.218 

1.042 

0.614 

5.147 

9.310 

14.334 

24.131 

4 

0.139  0.244 

1.114 

0.616 

8.083 

9.831 

1 1.568 

20.228 

5 

0.169  0.281 

1.153 

0.644 

2.622 

5.723 

11.189 

1 8 . b52 

6 

0.173  0.299 

1.061 

0.66b 

1.683 

7.330 

9.573 

16.555 

7 

0.175  0.330 

1.409 

0.702 

1 u.460 

5.201 

8.293 

15.640 

6 

0.221  0.350 

1.327 

0.772 

1 .790 

4.31b 

9.151 

14.535 

9 

0.224  0.367 

1.670 

0.797 

3.815 

2.766 

8.273 

13.344 

10 

0.239  0.376 

1.585 

0.864 

6.296******* 

7.920 

12.488 

1 1 

0.266-0.000 

1 .533******** 

14.327 

0.000 

8.0  32 

-O.Ouo 

12 

0 . 340-0 . OoO 

1 .479******** 

6.  70 

0.000 

9.420 

-0.000 

13 

0.371-0.000 

1 .505******** 

4.712 

0.000 

9.464 

-0.000 

14 

0.394-0.000 

1 .517******** 

2.646 

0.000 

9.345 

-0.000 

15 

0.405-0.000 

1 .504******** 

8.019 

0.000 

8 . 9b4 

-0.000 

16 

0.450-0.000 

1 .493******** 

1.450 

0.000 

9.338 

-0.000 

17 

0.454-0.000 

1 .664******** 

1.566 

0.000 

8.872 

-0.000 

18 

0.459-0.000 

1 .'886******** 

1 .687 

0.000 

8.475 

-0.000 

19 

0.466-0.000 

2.063******** 

2.400 

0.000 

8.135 

-0.000 

20 

0.476-0.000 

2. 127******** 

1.523 

0.000 

7.909 

-o.ooo 

21 

0.482-0.000 

2.301******** 

1.460 

o.ooo 

7.618 

-0.000 

22 

0.467-0.000 

2 . 458*  **  * * * ** 

3.953 

0.000 

7.348 

-0.000 

23 

0.509-0.000 

2. 392******** 

1.528 

0.000 

7.345 

-0.000 

24 

0.515-0.000 

2.603******** 

2.528 

0.000 

7.119 

-0.000 

25 

0.527-0.000 

2.554******** 

******* 

0.000 

7.003 

-0.000 

Figure  16.  Parameter  Comparisons  for  Pre-  and  Postiterative 
Optimization  of  MINAVE  Partitions  Using  All  Points 
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STATISTICS  FOR  DIGIT:  0;  RfcF  PT:  1?  NO  OF  DATA  PTS:  158 
MINA VE  AGGLOM  CLUSTERING*  23  F£b79 

PRE  AND  POST  ITERATIVE  OPTIMIZATION  FOK  MIN  JE 


c 

ERR 

NO  OF 
ITERS 

JE  ( = 

TKCw) ) 

JE(C)- 

JE(CM) 
/Jfc (C) 

TK(b)/TR(W) 

PRE 

POST 

PRE 

PG5  f 

Pkt 

put>  r 

1 

0.0 

0 

42925. 1 

46284.9 

0.124 

0.193 

0.000 

0 

.000 

2 

718.9 

7 

37601.3 
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Figure  17.  Parameter  Comparisons  for  Pre-  and  Postiterative 
Optimization  of  MINAVE  Partitions  With  Outliers  Discarded 
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Figure  18.  Parameter  Comparisons  for  Pre-  and  Postiterative 
Optimization  of  MINMAX  Partitions  Using  All  Points 
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Figure  19.  Parameter  Comparisons  for  Pre-  and  Postiterative 
Optimization  of  MINMAX  Partitions  With  Outliers  Discarded 
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Figure  20.  Parameter  Comparisons  for  Preiterative  Optimization 
of  M1NAVE  and  MINMAX  Partitions  Using  All  Points 


40 


STATISTICS  FOR  DIGIT:  0;  REF  PT:  l;  NO  OF  DATA  RTb:  loo 
MINA VE  AND  MINMAX  AGGLOM  CLUSTERING;  23  FE879 

POST  ITERATIVE  OPTIMISATION  FOR  Mlw  Jt 


NO 

OF 

J E ( C ) - 

JE  (CM  ) 

c 

ITERS 

JE  ( = 

TR(ft) ) 

/JE (C) 

TR (b) / TR ( A) 

AVt 

MAX 

AVt 

MAX 

AVt 

MAX 

AVE 

MAX 

1 

0 

0 

4b284 . 9 

4b2B4 . 9 

0.125 

0. 1 33 

0.000 

0 . 0O0 

2 

8 

54 

40504.1 

401  34.8 

0.118 

0.110 

0.143 

0.153 

3 

80 

23 

35711.4 

357  34.2 

0.057 

0.05b 

0.29b 

0.295 

4 

93 

7b 

3 3b  7 1 . b 

33738.4 

0 . o b 1 

0 . ObO 

0.37  5 

0.372 

5 

139 

84 

31b28.2 

3172b. 3 

0.03b 

0.044 

0 . 4 b 3 

0.459 

b 

124 

114 

30504.0 

30345.3 

0.04b 

0.019 

0.517 

0.525 

7 

141 

44 

29 1 0b  . b 

297  7 1 .7 

0.033 

0.047 

0.590 

0.555 

8 

9b 

bB 

28154.5 

28381.9 

0.027 

0.038 

0.844 

0 . b 32 

9 

80 

bO 

27385.4 

27293.4 

0.018 

0.029 

0.890 

0 . b9b 

10 

75 

48 

2b902 . 9 

28491 .2******* 

******* 

0.7  20 

0.747 

C (N-C) TR  (B) 

(N-C) *0tLJfc 

(N-C) *TR(8) 

C 

/2N(C- 

l)TR(R) 

BTL  ' S 

SIGMA 

/JE (C) 

/ ( C - 1 ) *TR(A) 

AVE 

MAX 

AVE 

MAX 

AVE 

MAX 

AVE 

MA  X 

1 

0.000 

0.000 

0.000 

0.000 

20.358 

21.859 

0.000 

0 . 000 

2 

0.139 

0.150 

0 .499 

0.305 

19.189 

17.783 

23. 121 

24.824 

3 

0.215 

0.215 

0 . b 1 4 

0.827 

9.19b 

9.001 

23.835 

23. 7b6 

4 

0.241 

0.239 

O.blb 

0.833 

9.710 

9.533 

19.979 

19.858 

5 

0.277 

0.275 

0 . b 4 4 

0 . b22 

5 . b52 

b . 92  1 

18.420 

18.241 

b 

0.295 

0.300 

0 . bbb 

0.883 

7.238 

2.98b 

18.348 

1 b . 599 

7 

0.32b 

0.30b 

0.702 

0.81b 

5.135 

7.435 

15.443 

14.514 

8 

0.34b 

0.339 

0.772 

0.839 

4 . 2b2 

5.877 

14.551 

14.083 

9 

0.3b  2 

0 . 3b5 

0.797 

0.380 

2.731 

4.55b 

13.371 

13.432 

10 

0.371 

0.385 

0 . Bb4 

0.891  i 

******* 

******* 

12.323 

12.785 

Figure  21.  Parameter  Comparisons  for  Postiterative  Optimization 
of  M1NAVE  and  MINMAX  Partitions  Using  All  Points 


After  all  class  assignments  and  contingency  tables  have  been  printed,  the  measure  just 
described  is  printed  in  a summary  listing  for  the  partition  comparisons  both  before  and  after 
optimization  (Figure  23). 

E.  OBSERVATIONS  ON  THE  CLUSTERING  RESULTS 

Since  the  conclusions  to  be  presented  in  this  subsection  have  not  been  proved  analytically 
but  have  only  been  observed  from  the  clustering  results,  they  are  presented  as  observations  only. 
However,  these  observations  are  made  on  the  basis  ol  the  results  ot  clustering  34  different 
pattern  types,  which  would  imply  some  generality  to  the  observations. 

In  investigating  the  properties  of  various  criteria,  it  is  first  useful  to  have  a measure  of  the 
distribution  of  the  class  size,  i.e.,  whether  most  of  the  samples  are  in  one  class  or  whether  they 
are  evenly  distributed  in  all  classes.  The  measure  used  was  a normalized  version  of  the  entropy 
given  by 
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E = (l/EMAX)  } \ (nt/n>  log2  (n,/n)  (19) 
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anil  hMAX  = log2c.  This  function  is  maximum  for  c equally  divided  clusters  having  iij  = n/c,  and 

is  minimum  for  <c  1 ) n's  = 1 clusters  having  one  cluster  n4  = (n  - c - 1 )/n. 

Observation  I : The  first  observation  is  experimental  confirmation  of  the  known  fact  in 
cluster  analysis  that  the  minimization  of  Je,  the  sum-of-squared  error,  favors  equal  sized  clusters, 
an  example  of  which  is  given  in  Figure  B-l.  One  demonstration  of  why  this  is  true  is  given  in 

Subsection  JV.B.  To  demonstrate  in  another  way,  remember  that  minimizing  JL.  is  equivalent  to 

maximizing  tr  SB,  given  by 


tr  SB 


EE 

i=  1 isl 


n^Hnij  - m^i2 


(20) 


Consider  just  one  of  the  terms.  Assuming  that  the  means  remain  relatively  constant  as  points  are 
changed  between  class  i and  class  j,  it  is  easy  to  show  that  iijiij  is  maximum  for  iij  = iij. 

The  results  of  the  iterative  optimization  based  on  minimizing  Jc  in  fact  showed  that  H (the 
normalized  entropy)  increased  after  the  optimization  as  shown  by  the  histograms  of  F.  for  the 
MINAVF  agglomerative  clustering  in  Figure  24  for  the  24  scanning  patterns  (c  = 2 through  10) 
and  in  Figure  25  for  the  10  recognition  patterns  (c  = 2 through  10). 

Observation  2:  M1NMAX  agglomerative  clustering  favors  equal-sized  clusters  more  than 
MINAVF  agglomerative  clustering.  No  previous  reference  to  this  type  of  observation  could  be 
found  in  the  clustering  literature,  although  many  references  existed  to  the  “chaining”  that  occurs 
in  agglomerative  clustering  when  combining  the  two  clusters  having  the  minimum  minimum 
distance  between  them. 

Figure  2b  shows  the  results  as  histograms  of  E for  both  the  MINAVF  and  the  MINMAX 
agglomerative  clustering  (before  iterative  optimization)  for  the  24  scanning  patterns  (c  = 2 
through  10).  The  results  shown  in  Figure  27  are  the  same  for  the  recognition  patterns. 


Observation  I indicated  that  lower  Je.v  (and,  hence,  larger  as)  result  from  more  equal-sized 
clusters,  so  that  two  “corollaries”  exist  for  observation  2.  The  first  is  that  Jc  for  M INMAX 
agglomerative  clustering  (preiterative  optimization)  is  generally  smaller  than  Je  for  MINAVE 
agglomerative  clustering.  The  second  corollary  is  that,  since  Je  is  usually  lower  for  the  MINMAX 
agglomeration,  the  number  of  iterations  is  also  lower  for  the  iterative  optimization  of  the 
MINMAX  agglomerative  clusters.  Histograms  are  not  given,  but  refer  to  the  columns  for  the 
number  of  iterations  given  in  the  10  tables  of  postiterative  optimization  statistics  for  the  10 
recognition  patterns  in  Appendix  D. 

Observation  In  spite  of  quite  different  starting  partitions,  the  iterative  optimization 
procedure  applied  to  the  MIMA VI.  agglomerative  clusters  and  to  the  MINMAX  agglomerative 
clusters  yielded  similar  partitions.  A measure  of  similarity  given  by  the  ratio  of  the  sum  of  the 
diagonal  entries  of  a contingency  table  divided  by  the  sum  of  all  the  entries  in  the  table  is  shown 
in  Table  7 for  both  scanning  and  recognition  patterns  before  iterative  optimization  and  in  Table 
8 for  both  pattern  types  after  iterative  optimization.  These  similarity  measures  are  for  the  first 
type  of  contingency  table  described  in  Subsection  IV. 1). 3. 

Observation  4:  Although  a was  larger  after  iterative  optimization  than  before  (as  it  should 
be  since  the  optimization  criterion  is  to  minimize  Jt. , which  is  equivalent  to  maximizing  a),  a 
almost  always  decreased  for  the  optimization  using  the  MINAVE  agglomerative  beginning  parti- 
tions and  decreased  for  about  half  the  pattern  types  for  optimization  using  the  MINMAX 
agglomerative  beginning  partitions.  Plots  of  opos(  /Opre  versus  apos(  /apre  are  given  in  Figure  28 
for  the  MIN  AVI-  clustering  and  in  Figure  2C7  lor  the  MINMAX  clustering  for  scanning  patterns. 

Two  important  points  should  be  made.  The  first  point  is  that  the  a decrease  was  much 
greater  lor  the  MINA  VP.  case  than  tor  the  MINMAX  case  because  apru  was  much  larger  for  the 
former,  as  can  be  seen  by  comparing  Figure  30  and  Figure  31,  which  show  apre  versus  apre  for 
scanning  pattterns  in  both  cases.  As  a matter  of  interest.  Figure  32  shows  a versus  a for 
postoptimization  of  the  MIN  AVI-,  agglomerative  clustering  partitions,  showing  that  as  the  dusters 
approach  equal  sizes,  as  happens  for  the  optimization  (observation  I),  o and  a approach  the 
same  value.  In  fact,  for  equally  divided  cluster  partitions,  o = a,  as  can  be  seen  from  the  final 
two  expressions  in  Subsection  IV. C by  setting  n,  = iij  = n/c.  The  second  point  to  remember  is  to 
temper  the  conclusions  reached  about  the  relationship  between  a and  a by  the  fact  that 
optimization  is  with  respect  to  a (actually  Jc ) while  o is  being  monitored  only.  Possibly  different 
conclusions  would  be  reached  if  iterative  optimization  were  with  respect  to  o with  a being 
monitored  only. 

F.  TESTING  CLUSTER  VALIDITY  WITH  A PRIORI  INFORMATION 

ABOUT  DATA 

The  problem  of  testing  cluster  validity  is  a subject  that  has  received  very  little  attention  in 
the  literature,  probably  because  of  the  difficulty  of  the  problem.  One  of  the  few  references  is  in 
Hilda  and  Hart,2*  who  use  a hypothesis  testing  approach  to  test  validity  on  the  basis  of  the  size 
of  the  reduction  in  Je.  In  the  specific  example  given,  they  assume  multivariate  normal  distribu- 
tions and  advance  the  hypothesis  that  the  data  are  actually  from  one  cluster.  They  then  derive 
an  expression  for  testing  this  hypothesis  for  Je  to  a specified  significance  level. 

The  approach  taken  in  this  section  is  an  entirely  different  method  for  testing  cluster 
validity.  In  the  total  voice  speaker  verification  final  report,1  descriptions  of  the  characteristics  of 
the  reference  patterns  generated  from  the  clustering  algorithm  are  given  in  terms  of  a priori 
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Figure  24.  Histograms  of  Normalized  Entropy  as  Measure  of  Dispersion  in 
Class  Sizes  for  MINAVE  Agglomerative  Clustering  of  Scanning  Patterns 


information  known  about  the  data  points  making  up  each  class.  In  this  study,  a quantitative 
measure  is  used  for  the  same  type  of  comparison.  Specifically,  it  is  assumed  that  a male/female 
division  of  the  data  is  a correct  way  to  separate  the  data,  and  then  the  degree  to  which  the 
actual  clusters  agree  with  this  assumption  is  measured.  Because  of  the  differences  in  vocal  tract 
resonances  (formants)  between  males  and  females,  this  is  a good  assumption  in  most  cases 
(probably  a better  assumption  than  assuming  a unimodal  distribution  for  the  data).  Reference  to 
the  total  voice  final  report,  however,  reveals  cases  where  the  data  actually  cluster  on  the  basis  of 
other  attributes  such  as  the  scanning  patterns  for  the  third  reference  point  of  “two"  which,  since 
the  formants  for  /u/  for  males  and  females  are  very  close,  splits  according  to  context.  This  would 
suggest  extending  the  technique  in  this  subsection  to  account  quantitatively  for  multiple  attri- 
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Figure  25.  Histograms  of  Normalized  Entropy  as  Measure  of  Dispersion  in 
Class  Sizes  for  MINAVE  Agglomerative  Clustering  of  Recognition  Patterns 


The  proposal  is  that  the  average  information  gained  by  knowing  in  which  class  a point  falls 
should  be  reduced  by  the  u priori  knowledge  of  an  attribute  of  that  point,  if  the  classes  represent 
that  attribute.  Alternately  stated,  the  proposal  is  that  the  average  uncertainty  about  the  class  in 
which  a point  falls  should  be  reduced  by  the  amount  of  certainty  gained  about  the  class 
membership,  knowing  the  attribute  (the  sex  in  this  case).  From  the  information  theory  literature 
(e.g.,  Re/a33),  the  average  uncertainty  is  the  entropy,  given  by* 


H(c)  * 


p(i ) log  p(i) 


33 f M.  Rcza,  An  Introduction  to  Information  Theory.  New  York:  McGraw-Hill,  1961. 
*AII  logarithms  are  taken  to  base  2. 
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Figure  26.  Histograms  of  Normalized  Entropy  as  Measure  of  Dispersion  in 
Class  Size  for  Preiterative  Optimization  of  Scanning  Patterns 
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Figure  27.  Histograms  of  Normalized  Entropy  as  Measure  of  Dispersion  in 
Class  Sizes  for  Preiterative  Optimization  of  Recognition  Patterns 
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TABLE  7.  CONTINGENCY  TABLE  OF  THE  FIRST  KIND* 
RESULTS  FOR  PREITERATIVE  OPTIMIZATION 

For  Scanning  Patterns 


Ref  Contingency  Table  Ratio  for  Given  Number  of  Gasses 


Digit 

Point 

2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

1 

0.440 

0.548 

0.542 

0.536 

0.470 

0.470 

0.392 

0.416 

0.434 

0 

2 

0.663 

0.536 

0.548 

0.373 

0.518 

0.536 

0.524 

0.542 

0.524 

0 

3 

1.000 

0.928 

0.934 

0.825 

0.536 

0.500 

0.488 

0.506 

0.416 

1 

1 

0.815 

0.643 

0.655 

0.756 

0.750 

0.720 

0.744 

0.643 

0.643 

1 

0.964 

0.833 

0.661 

0.661 

0.661 

0.524 

0.554 

0.357 

0.369 

2 

1 

0.929 

0.91  1 

0.702 

0.708 

0.452 

0.470 

0.554 

0.399 

0.399 

2 

■) 

0.765 

0.756 

0.524 

0.286 

0.500 

0.482 

0.482 

0.583 

0.476 

■> 

3 

0.613 

0.821 

0.821 

0.714 

0.744 

0.738 

0.732 

0.804 

0.798 

3 

1 

0.858 

0.657 

0.639 

0.509 

0.633 

0.609 

0.533 

0.568 

0.586 

3 

2 

0.592 

0.432 

0.438 

0.456 

0.456 

0.604 

0.639 

0.544 

0.533 

4 

1 

0.9 1 6 

0.940 

0.725 

0.689 

0.533 

0.533 

0.401 

0.467 

0.467 

4 

7 

1.000 

0.641 

0.641 

0.641 

0.461 

0.461 

0.551 

0.617 

0.605 

5 

1 

0.615 

0.538 

0.544 

0.527 

0.509 

0.669 

0.657 

0.663 

0.675 

5 

2 

0.609 

0.396 

0.391 

0.391 

0.556 

0.604 

0.556 

0.651 

0.639 

6 

1 

0.970 

0.707 

0.719 

0.551 

0.527 

0.587 

0.611 

0.557 

0.563 

6 

2 

0.569 

0.575 

0.389 

0.587 

0.593 

0.581 

0.479 

0.473 

0.533 

6 

3 

0.665 

0.563 

0.671 

0.599 

0.539 

0.581 

0.539 

0.551 

0.599 

7 

1 

0.994 

0.756 

0.583 

0.417 

0.429 

0.429 

0.417 

0.393 

0.399 

7 

2 

0.881 

0.821 

0.607 

0.500 

0.518 

0.435 

0.536 

0.548 

0.548 

7 

3 

0.827 

0.536 

0.524 

0.571 

0.494 

0.494 

0.589 

0.565 

0.530 

8 

1 

0.798 

0.512 

0.464 

0.494 

0.589 

0.423 

0.393 

0.446 

0.440 

8 

2 

0.583 

0.679 

0.571 

0.601 

0.565 

0.589 

0.417 

0.417 

0.458 

9 

1 

0.515 

0.544 

0.562 

0.373 

0.527 

0.462 

0.485 

0.497 

0.544 

9 

2 

0.621 

0.538 

0.538 

0.580 

0.592 

0.621 

0.627 

0.633 

0.716 

For  Recognition  Patterns 

Contingency  Table  Ratio  for 

• Given  Number  of  Classes 

Digit 

2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

0.807 

0.578 

0.590 

0.434 

0.434 

0.416 

0.440 

0.434 

0.349 

1 

0.750 

0.607 

0.619 

0.530 

0.435 

0.363 

0.387 

0.375 

0.369 

2 

0.500 

0.500 

0.482 

0.494 

0.542 

0.536 

0.512 

0.524 

0.411 

3 

0.633 

0.556 

0.408 

0.604 

0.544 

0.491 

0.509 

0.456 

0.479 

4 

0.772 

0.790 

0.790 

0.5 1 5 

0.491 

0.491 

0.497 

0.491 

0.497 

5 

0.858 

0.775 

0.763 

0.728 

0.604 

0.633 

0.633 

0.491 

0.491 

6 

0.778 

0.707 

0.605 

0.617 

0.593 

0.587 

0.599 

0.587 

0.689 

7 

0.494 

0.821 

0.750 

0.565 

0.583 

0.583 

0.595 

0.560 

0.583 

8 

0.554 

0.464 

0.631 

0.619 

0.518 

0.524 

0.548 

0.464 

0.548 

9 

0.491 

0.538 

0.556 

0.574 

0.396 

0.361 

0.320 

0.408 

0.432 

♦Refer  to  Subsection  IV.D.3. 
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TABLE  8.  CONTINGENCY  TABLE  OF  THE  FIRST  KIND* 
RESULTS  FOR  POSTITERATIVE  OPTIMIZATION 

For  Scanning  Patterns 

Ref  Contingency  Table  Ratio  for  Given  Number  of  Classes 


Digit 

Point 

2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

1 

0.633 

0.964 

0.657 

0.795 

0.699 

0.627 

0.633 

0.651 

0.735 

0 

2 

1 .000 

1 .000 

0.723 

0.807 

0.590 

0.596 

0.596 

0.741 

0.753 

0 

3 

1.000 

0.669 

1 .000 

0.614 

0.687 

0.699 

0.633 

0.669 

0.657 

1 

1 

1 .000 

0.857 

1 .000 

0.863 

0.631 

0.458 

0.518 

0.732 

0.827 

1 

2 

0.958 

0.815 

0.554 

0.702 

0.548 

0.708 

0.655 

0.708 

0.702 

2 

1 

0.917 

0.815 

0.625 

0.708 

0.708 

0.685 

0.655 

0.571 

0.571 

2 

2 

1.000 

1 .000 

0.637 

0.726 

0.857 

0.810 

0.738 

0.732 

0.714 

2 

3 

1.000 

0.696 

1.000 

0.571 

0.756 

0.917 

0.869 

0.935 

0.899 

3 

1 

0.675 

0.775 

0.462 

0.799 

0.917 

0.663 

0.568 

0.663 

0.669 

3 

2 

0.988 

0.633 

0.970 

0.716 

0.686 

0.716 

0.734 

0.710 

0.704 

4 

1 

1.000 

1.000 

0.856 

0.826 

0.587 

0.689 

0.695 

0.766 

0.641 

4 

2 

1.000 

0.808 

0.665 

0.665 

0.605 

0.641 

0.7 1 9 

0.743 

0 635 

5 

1 

0.675 

0.503 

0.923 

0.799 

0.893 

0.716 

0.462 

0.792 

0.586 

5 

2 

1.000 

0.964 

0.959 

0.846 

0.710 

0.94! 

0.822 

0.675 

0.675 

6 

1 

1.000 

1.000 

0.844 

0.707 

0.880 

0.850 

0.665 

0.599 

0.635 

6 

2 

1 .000 

0.647 

0.850 

0.737 

0.814 

0.665 

0.575 

0.713 

0.713 

6 

3 

1.000 

0.467 

0.689 

1.000 

0.886 

0.629 

0.701 

1.000 

0.904 

7 

1 

1.000 

0.988 

0.500 

0.929 

0.750 

0.601 

0.720 

0.655 

0.726 

7 

2 

1.000 

0.577 

0.762 

0.875 

0.964 

0.821 

0.702 

0.607 

0.667 

7 

3 

0.488 

0.988 

0.982 

0:601 

0.613 

0.714 

0.661 

0.798 

0.756 

8 

1 

1.000 

0.839 

0.940 

0.685 

0.488 

0.470 

0.607 

0.690 

0.661 

8 

2 

0.952 

0.685 

0.637 

0.673 

0.679 

0.571 

0.851 

0.714 

0.690 

9 

1 

1.000 

0.781 

0.680 

0.645 

0.787 

0.775 

0.686 

0.746 

0.917 

9 

2 

0.692 

0.757 

0.84o 

1.000 

0.746 

0.716 

0.799 

0.805 

0.870 

For  Recognition  Patterns 

Contingency  Table  Ratio  for 

Given  Number  of  Classes 

Digit 

2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

1.000 

0.753 

0.789 

0.590 

0.639 

0.861 

0.783 

0.753 

0.614 

1 

0.940 

0.738 

0.940 

0.780 

0.655 

0.661 

0.792 

0.827 

0.685 

2 

1.000 

0.964 

0.798 

0.851 

0.708 

0.744 

0.696 

0.565 

0.607 

3 

1.000 

0.645 

0.686 

0.562 

0.556 

0.515 

0.562 

0.562 

0.592 

4 

0.988 

0.916 

0.491 

0.617 

0.563 

0.743 

0.587 

0.563 

0.647 

5 

1.000 

0.834 

0.728 

0.817 

0.609 

0.580 

0.734 

0.757 

0.852 

6 

1.000 

0.982 

0.743 

0.796 

0.832 

0.689 

0.796 

0.629 

0.593 

7 

1 .000 

0.929 

0.726 

0.530 

0.679 

0.554 

0.714 

0.655 

0.601 

8 

1.000 

1 .000 

1 .000 

0.958 

0.839 

0.679 

0.685 

0.565 

0.726 

9 

1.000 

0.793 

1.000 

0.935 

0.609 

0.669 

0.645 

0.651 

0.669 

*Refer  to  Subsection  IV.D.3. 
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Figure  28.  opost/opre  Versus  Opost/apre  for  MINAVE  Aggiomerative  Clustering  of  Scanning  Patterns 
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Figure  29.  Upost  /oprc  Versus  ftpoji/  ftpn* for  MINMAX 
Agglomeralive  Clustering  of  Scanning  Patterns 
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Figure  30.  a Versus  a for  Preiterative  Optimization  of 
MINAVE  Agglomerative  Clusters  of  Scanning  Patterns 
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Figure  31.  a Versus  a for  Preiterative  Optimization  of 
MINMAX  Agglomerative  Clusters  of  Scanning  Patterns 


(“H”  is  used  in  this  subsection  to  agree  with  the  information  theory  literature.  The  “E”  used  in 
the  last  subsection  is  reserved  for  the  normalized  entropy,  i.e.,  H/log2c.) 


The  average  information  about  the  class,  given  knowledge  of  the  sex,  is  the  conditional 
entropy: 


Utcls) 


p(i|m)  log  p(i|m) 


i“  I 


pfi|f)  log  p(i  10 


(22) 
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Figure  .12.  a Versus  a for  Postiterative  Optimization  of 
MINAVE  Agglomerative  Clusters  of  Scanning  Patterns 


The  average  information  gained  from  tile  clustering  is  then  reduced  by  the  a priori 
information  represented  in  the  classes,  yielding  what  information  theorists  term  I (c.si,  the 
average  of  the  mutual  information  between  the  class  and  the  ses.  The  value  for  l(c:s|  is  given  by 

l(c:s)  = H(c)  H(c|s) 


p(i)  log  p(i)  + p(  m ) 


p(i|m)  log  pti|m) 


(23) 


+ 


p(i|f)  log  p( i 1 0 


Note  that  since  H(c)  H(c|s)  = H(s)  H(s|c),  the  term  l(c;s)  can  also  be  written  as: 
lies)  = p(m)  log  p(m)  p(f)  log  p(f) 

C C 

+ ^ p(i)  p( in | i ) log  p(m|i)  + ^ p(i)  p(f|i)  log  p(m|i) 
i=  1 i- 1 


(24) 


Note  also  that  l(c;s)  = 11(c)  for  the  two-class  case  when  a population  with  an  equal  number  of 
males  and  females  are  divided  into  an  all-male  class  and  an  all-female  class  since  the  class  is 
uniquely  determined  by  knowing  the  sex.  This  corresponds  to  what  is  called  a “noise-free 
channel"  in  information  theory.  In  contrast,  when  no  information  is  transmitted  through  a 
channel,  11(c)  = H(c|s),  yielding  1 ( c ts)  = 0,  which  corresponds  to  the  case  of  each  class  having  an 
equal  number  of  males  and  females. 

However,  l(c;s)  is  used  here  as  a measure  of  the  “goodness’*  of  the  clustering  relative  to 
the  condition  (sex  in  this  case)  tested.  The  estimates  used  for  the  various  probabilities  are  given 
in  terms  of 

n Total  number  of  samples 

nj  Number  of  samples  in  class  i 

nm  Number  of  males 

n1  Number  of  females 

nj"  Number  of  males  in  class  i 

nj  Number  of  females  in  class  i 

The  given  probabilities  are  estimated  as 
p( i)  = nj/n 
p(m)  = nm  / n 
pff)  = nf/n 
p(i|tn)  = n["/nm 
p(il0  = nf  /nf 
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TABLE  18  MARTIN  HERSCHER  DIGIT  TEXTS 


p(m|i)  = nl'Viij 
p(  f I i ) = nj/iij 

Hence,  ltc:s)  is  calculated  by 


l(c:s)  = 


^ 1'j  <n,/n)  log  (nj/n) 

i i 


C 

Z 


+(nm/n)  7 (nflt\m)  log  (n ■n/nm) 


i=  1 


u 

+(n'/n)  ^ (n'j/n1 I log  (n'/n1 


i=  1 


(25) 


Values  of  I are  given  in  Tables  9 through  12  for  scanning  patterns  and  in  Tables  13 
through  16  for  recognition  patterns,  along  with  another  measure,  K,  the  distribution  of  males 
and  females  among  the  classes,  lor  the  reader  with  a less  esoteric  inclination.  The  value  lor  R is 
given  by 

min  ( n' , nj"  I (26) 

i=i 


which  is  a measure  of  the  residue  in  each  ol  the  classes.  I he  tables  in  each  ol  the  two  sets  are 
arranged  by  clustering  algorithm  in  the  following  order: 

( 1 1 MINAVF  agglomerative  clustering;  preiterative  optimization 

(2)  MINAVT  agglomerative  clustering;  postiterative  optimization 

(3)  M1NMAX  agglomerative  clustering;  preiterative  optimization 

(4)  MINMAX  agglomerative  clustering;  postiterative  optimization. 

The  information  in  Tables  9 through  12  is  summarized  in  Figure  33  with  histograms  ol  I tor 
each  of  cases  1 through  4 above.  Likewise,  the  information  in  Tables  13  through  16  is 
summarized  in  Figure  34.  It  is  clear  Irotn  these  two  figures  that  iterative  optimization  to 
minimize  Jt.  improves  the  resulting  clusters,  assuming  the  male/lemale  distinction  is  valid  (and 
from  an  acoustic-phonetic  standpoint,  it  is).  In  addition,  these  two  figures  show  that  MINMAX 
agglomerative  clustering  yields  better  clusters  belore  the  iterative  optimization  than  does 
MINAVF.  However,  no  clear  preference  results  between  the  MINAVI  and  MINMAX  clusters 
after  iterative  optimization,  showing  that  the  iterative  optimization  algorithm  was  robust  enough 
to  produce  good  clusters  from  either  the  MINAVF.  or  MINMAX  agglomerative  clustering,  even 
though  the  starting  partition  produced  from  the  MINAVF  agglomeration  was  clearly  infe..or. 
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TABLE  19  SYNOPSIS  OF  EVALUATION  RESULTS 

Percent 


TABLE  9 MUTUAL  INFORMATION  AND  RESIDUES  FOR  PREITERATIVE  OPTIMIZATION 
OF  MINAVE  AGGLOMERATIVE  ( LUSTERS  OF  SCANNING  PATTERNS 


DS/RP 

cs  2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

1 

I : 

0.030 

0.091 

0.119 

0.179 

0.195 

0.217 

0.266 

0.269 

0.289 

0 

1 

R : 

0.434 

0.410 

0.410 

0.355 

0.355 

0.349 

0.259 

0.259 

0.259 

0 

2 

I: 

0.00b 

0.024 

0.036 

0.038 

0.372 

0.372 

0.395 

0.401 

0.466 

0 

2 

R: 

0.494 

0.462 

0.482 

0.470 

0.167 

0.187 

0.175 

0.175 

0.139 

0 

3 

I : 

0.0S2 

0.093 

0.103 

0.103 

0.125 

0.125 

0.130 

0.143 

0.143 

0 

3 

R : 

0.392 

0.373 

0.373 

0.373 

0.355 

0.355 

0.355 

0.355 

0.355 

1 

1 

Ii 

0.00b 

0.014 

0.015 

0.034 

0.043 

0.049 

0.062 

0.075 

0.075 

1 

1 

r: 

0.468 

0.466 

0.482 

0.435 

0.435 

0.435 

0.4  35 

0.423 

0.423 

1 

2 

IS 

0.00b 

0.012 

0.082 

0.086 

0.066 

0.086 

0.097 

0.097 

0.124 

1 

2 

R : 

0.486 

0.486 

0.423 

0.411 

0.411 

0.411 

0.411 

0.411 

0.387 

2 

1 

I : 

0.005 

0.00b 

0.02b 

0.032 

0.038 

0.039 

0.058 

0.064 

0.071 

2 

\ 

fl : 

0.488 

0.482 

0.47b 

0.470 

0.464 

0.458 

0.411 

0.411 

0.40S 

2 

3 

It 

0.018 

0.018 

0.018 

0.025 

0.714 

0.755 

0.755 

0.757 

0.794 

2 

2 

r: 

0.482 

0.482 

0.482 

0.476 

0.060 

0.042 

0.042 

0.042 

0.03b 

2 

3 

is 

0.007 

0.025 

0.026 

0.030 

0.042 

0.062 

0.062 

0.069 

0.069 

2 

3 

r: 

0.452 

0.429 

0.429 

0.429 

0.429 

0.411 

0.411 

0.411 

0.411 

3 

1 

1$ 

0.000 

0.026 

0.035 

0.040 

0.046 

0.050 

0.054 

0.079 

0.081 

3 

1 

r: 

0.497 

0.45b 

0.450 

0.444 

0.420 

0.420 

0.420 

0.4U6 

0.402 

3 

2 

is 

0.000 

0.007 

0.046 

0.04b 

0.074 

0.154 

0.154 

0.162 

0.165 

3 

2 

R: 

0.497 

0.491 

0.438 

0.438 

0.432 

0.320 

0.320 

0.314 

0.314 

4 

1 

I! 

0.00b 

0.049 

0.061 

0.087 

0.118 

0.124 

0.12b 

0.133 

0.138 

4 

1 

R: 

0.497 

0.455 

0.455 

0.443 

0.413 

0.413 

0.413 

0.389 

0.389 

4 

2 

I* 

0.00b 

0.016 

0.018 

0.124 

0.140 

0.14b 

0.208 

0.265 

0.320 

4 

2 

Rs 

0.497 

0.485 

0.485 

0.377 

0.347 

0.347 

0.335 

0.269 

0.251 

5 

1 

is 

0.055 

0.14b 

0.218 

0.222 

0.230 

0.236 

0.236 

0.245 

0.245 

5 

1 

Rs 

0.432 

0.355 

0.302 

0.302 

0.302 

0. 3u2 

0 . 3c2 

0.302 

0.302 

5 

2 

IS 

0.000 

0.006 

0.043 

0.049 

0.29b 

0.314 

0.332 

0.35b 

0.358 

5 

2 

RS 

0.497 

0.491 

0.45b 

0.456 

0.201 

0.201 

0.2ul 

0.201 

0.201 

6 

1 

IS 

0.20b 

0.230 

0.230 

0.268 

0.272 

0.620 

0.621 

0.642 

0.659 

6 

1 

RS 

0.281 

0.275 

0.275 

0.275 

0.275 

0.076 

0.078 

0.072 

0.066 

b 

2 

IS 

0.00b 

0.012 

0.018 

0.633 

0.634 

0.634 

0.714 

0.714 

0.720 

b 

2 

RS 

0.497 

0.491 

0.485 

0.072 

0.072 

0.072 

0.072 

0.072 

0.072 

b 

3 

IS 

0.005 

0.017 

0.051 

0.178 

0.185 

0.207 

0.207 

0.248 

0.248 

b 

3 

RS 

0.491 

0.479 

0.419 

0.317 

0.317 

0.317 

0.317 

0.287 

0.287 

7 

1 

IS 

0.023 

0.029 

0.570 

0.598 

0.620 

0.623 

0.623 

0.641 

0.648 

7 

i 

RS 

0.486 

0.462 

0.069 

0.089 

0.089 

0.089 

0.089 

0.089 

0.083 

7 

2 

IS 

0.006 

0.591 

0.596 

0.601 

0.603 

0.604 

0.604 

0.604 

0.604 

7 

2 

RS 

0.488 

0.063 

0.083 

0.063 

0.083 

0.083 

0.083 

0.083 

0.083 

7 

3 

IS 

0.006 

0.031 

0.036 

0.081 

0.101 

0.101 

0.120 

0.120 

0.141 

7 

3 

RS 

0.468 

0.464 

0.452 

0.452 

0.446 

0.446 

0.399 

0.399 

0.399 

a 

a 

1 

IS 

RS 

0.041  0.104 
0.435  0.361 

i:Ul 

0.405 

0.161 

0.421 

0.161 

0.475 

0.149 

0.476 

0.149 

0.478 

0.149 

a 

2 

IS 

0.006 

0.082 

0.103 

8:111 

0.126 

0.126 

0.126 

0.126 

0.144 

a 

2 

r: 

0.488 

0.405 

0.405 

0.393 

0.393 

0.393 

0.393 

0.357 

9 

1 

i: 

0.023 

0.04b 

0.050 

0.050 

0.270 

0.285 

0.306 

0.334 

0.353 

9 

1 

RS 

0.485 

0.45b 

0.456 

0.456 

0.219 

0.219 

0.207 

0.207 

0.207 

9 

2 

IS 

0.012 

0.021 

0.032 

0.034 

0.040 

0.082 

0.085 

0.099 

0.247 

9 

2 

RS 

0.491 

0.473 

0.473 

0.467 

0.467 

0.4U2 

0.402 

0.39b 

0.296 

TABLE  10.  MUTUAL  INFORMATION  AND  RESIDUES  FOR  POSTITERATIVE  OPTIMIZATION 
OF  MINAVE  AGGLOMERATIVE  CLUSTERS  OF  SCANNING  PATTERNS 


DG/RP  C : 2 3 4 5 6 7 6 9 10 

0  1 I!  0.039  0.575  0.387  0.529  0.568  0.552  0.598  0.621  0.631 

0  1 R:  0.922  0.096  0.211  0.151  0.096  0.195  0.119  0.108  0.102 

0  2 I!  0.389  0.997  0.909  0.977  0.993  0.560  0.556  0.©03  0.578 

0  2 R:  0.157  0.181  0.217  0.139  0.151  0.157  0.195  0.151  0.139 

0 3 IS  0.099  0.067  0.177  0.295  0.28b  0.215  0.257  0.269  0.27b 

0 3 R:  0.392  0.36b  0.2b5  0.253  0.253  0.307  0.271  0.301  0.289 

1 1 IS  0.023  0.015  0.090  0.096  0.059  0.096  0.051  0.079  0.102 

1  1 RS  0.9U  0.929  0.905  0.399  0.387  0.393  0.393  0.363  0.381 

1  2 IS  0.323  0.285  0.955  0.928  0.92o  0.382  0.969  0.901  0.916 

1 2 RS  0.190  0.226  0.185  0.190  0.179  0.185  0.202  0.219  0.c02 

2 1 IS  0.075  0.063  0.089  0.080  0.129  0.126  0.132  0.139  0.168 

2  1 RS  0.357  0.375  0.351  0.363  0.310  0.315  0.309  0.309  0.280 

2  2 IS  0.679  0.658  0.5b9  0.705  0.712  0.759  0.732  0.665  0.709 

2  2 RS  0.060  0.077  0.101  0.060  0.083  0.083  0.069  0.107  0.089 

2 3 IS  0.013  0.013  0.019  0.069  0.056  0.069  0.065  0.113  0.155 

2 3 RS  0.935  0.935  0.996  0.357  0.375  0.363  0.369  0.395  0.333 

3 l IS  0.009  0.029  0.095  0.079  0.057  0.131  0.180  0.169  0.158 

3  1 RS  0.999  0.902  0.902  0.355  0.373  0.337  O.308  0.325  0.302 

3  2 IS  0.058  0.062  0.103  0.159  0.191  0.157  0.181  0.281  0.289 

3  2 RS  0.361  0.361  0.337  0.302  0.320  0.320  0.314  0.293  0.237 

9 1 IS  0.001  0.036  0.306  0.29b  0.309  0.266  0.258  0.902  0.355 

9 1 RS  0.985  0.901  0.228  0.216  0.198  0.239  0.2b9  0.192  0.239 

9 2 IS  0.220  0.391  0.393  0.25b  0.316  0.392  0.395  0.3b3  0.935 

9 2 RS  0.239  0.198  0.198  0.249  0.239  0.2b3  0.180  0.209  0.166 

5  1 IS  0.523  0.935  0.397  0.522  0.610  0.617  0.951  0.557  0.535 

5  1 RS  0.112  0.136  0.166  0.130  0.118  0.130  0.189  0.192  0.130 

5  2 IS  0.935  0.381  0,916  0.329  0.929  0.378  0.929  0.900  0.988 

5 2 RS  0.13b  0.172  0.178  0.231  0.169  0.189  0.195  0.189  0.183 

6 1 IS  0.235  0.951  0.926  0.619  0.600  0.636  0.723  0.671  0.722 

6  1 RS  0.269  0.162  0.15b  0.078  0.089  0.u7b  0.066  0.069  0.072 

6  2 IS  0.799  0.761  0.799  0.790  0.668  0.633  0.732  0.603  0.729 

6  2 RS  0.092  0.042  0.036  0.098  0.090  0.136  0.108  0.144  0.090 

6 3 IS  0.177  0.236  0.259  0.346  0.304  0.413  0.413  0.345  0.343 

6 3 RS  0.269  0.257  0.263  0.18b  0.210  0.192  0.174  0.198  0.198 

7 1 IS  0.732  0.599  0.701  0.683  0.728  0.655  0.662  0.650  0.764 

7  1 RS  0.048  0.089  0.065  0.077  0.065  0.089  0.083  0.101  0.065 

7  2 IS  0.607  0.685  0.573  0.720  0.680  0.717  0.709  0.736  0.680 

7  2 RS  0.077  0.071  0.095  0.065  0.095  0.071  0.071  0.077  0.101 

7 3 IS  0.014  0.335  0.346  0.313  0.375  0.425  0.404  0.388  0.445 

7 3 RS  0.435  0.238  0.220  0.214  0.232  0.206  0.226  0.220  0.226 

8 1 IS  0.139  0.281  0.183  0.300  0.299  0.322  0.4bl  0.417  0.427 

8  1 RS  0.286  0.214  0.28b  0.202  0.208  0.196  0.167  0.173  0.190 

6 2 IS  0.334  0.238  0.269  0.575  0.392  0.526  0.445  0.496  0.527 

6 2 RS  0.185  0.250  0.220  0.137  0.208  0.167  0.250  0.165  0.196 


9 1 IS  0.494  0.515  0.484  0.456  0.544  0.514  0.557  0.566  0.566 
9 1 RS  0.112  0.154  0.225  0.207  0.142  0.154  0.136  0.124  0.124 


TABLE  II.  MUTUAL  INFORMATION  AND  RESIDUES  FOR  P REITERATIVE  OPTIMIZATION 
OF  MINMAX  AGGLOMERATIVE  CLUSTERS  OF  SCANNING  PATTERNS 


OG/RP 

Ci  2 

3 

4 

5 

6 

7 

6 

9 

10 

0 

1 

i: 

0.222 

0.356 

0.398 

0.398 

0.399 

0.423 

0.423 

0.438 

0.440 

0 

1 

r: 

0.23b 

0.169 

0.169 

0.169 

0.169 

0.  lbV 

0.164 

0.169 

0.169 

0 

2 

I : 

0.022 

0.105 

0.133 

0.373 

0.390 

0.453 

0.461 

0.461 

0.466 

0 

2 

r: 

0.416 

0.380 

0.380 

0.205 

0.199 

0.163 

0.163 

0 . 1 o3 

0.157 

0 

3 

i: 

0.052 

0.060 

0.072 

0.079 

0.165 

0.182 

0.277 

0.291 

0.291 

0 

3 

r: 

0.392 

0.392 

0.386 

0.38b 

0.271 

0.271 

0.235 

0.235 

0.235 

1 

1 

i: 

0.009 

0.017 

0.032 

0.035 

0.092 

0.119 

0.120 

0.130 

0.134 

1 

1 

R: 

0.464 

0.429 

0.417 

0.417 

0.393 

0.393 

0.393 

0.393 

0.393 

1 

2 

It 

0.031 

0.032 

0.155 

0.158 

0.159 

0.214 

0.223 

0.525 

0.525 

1 

2 

r: 

0.458 

0.458 

0.310 

0.310 

0.304 

0.304 

0.304 

0.149 

0.149 

2 

1 

It 

0.005 

0.006 

0.065 

0.080 

0.088 

0.104 

0.104 

0.104 

0.126 

2 

1 

R: 

0.476 

0.476 

0.363 

0.357 

0.357 

0.357 

0.357 

0.357 

0.345 

2 

2 

i: 

0.005 

0.005 

0.343 

0.455 

0.493 

0.516 

0.522 

0.526 

0.528 

2 

2 

k: 

0.464 

0.464 

0.238 

0.238 

0.196 

0.165 

0.174 

0.179 

0.179 

2 

3 

i: 

0.002 

0.006 

0.033 

0.034 

0.041 

0.054 

0.056 

0.057 

0.063 

2 

3 

R: 

0.462 

0.458 

0.429 

0.429 

0.429 

0.417 

0.411 

0.411 

0.411 

3 

1 

i: 

0.004 

0.078 

0.090 

0.173 

0.174 

0.203 

0.204 

0.204 

0.204 

3 

1 

R: 

0.467 

0.367 

0.349 

0.306 

0.308 

0.308 

0.3U6 

0.306 

0 . 308 

3 

2 

It 

0.093 

0.133 

0.169 

0.169 

0.199 

O.207 

0.212 

0.221 

0.230 

3 

2 

R: 

0.331 

0.331 

0.272 

0.272 

0.272 

0.272 

0.272 

0.272 

0.272 

4 

1 

It 

0.000 

0.026 

0.186 

0.239 

0.311 

0.350 

0.361 

0.363 

0.370 

4 

1 

R : 

0.497 

0.473 

0.293 

0.246 

0.210 

0.196 

0.196 

0.198 

0.198 

4 

2 

It 

0.006 

0.013 

0.062 

0.084 

0.099 

0.130 

0.173 

0.405 

0.420 

4 

2 

R: 

0.497 

0.449 

0.395 

0.395 

0.359 

0.359 

0.329 

0.216 

0.216 

5 

1 

It 

0.001 

0.083 

0.144 

0.167 

0.304 

0.340 

0.362 

0.385 

0.418 

5 

1 

R: 

0.479 

0.396 

0.361 

0.320 

0.254 

0.231 

0.219 

0.219 

0.219 

5 

2 

It 

0.065 

0.266 

0.290 

0.375 

0.387 

0.414 

0.470 

0.483 

0.483 

5 

2 

Rt 

0.355 

0.237 

0.237 

0.189 

0.189 

0.169 

0.169 

0.189 

0.189 

1 

It 

0.250 

8:31? 

0.352 

0.556 

0.558 

0.569 

0.569 

0.570 

0.592 

1 

Rt 

0.251 

0.246 

0. 096 

0.096 

0.096 

0.096 

0.096 

0.090 

2 

It 

0.550 

0.556 

0.619 

0.621 

0.622 

0.691 

0.697 

0.712 

0.715 

2 

Rt 

0.102 

0.102 

0.102 

0.102 

0.102 

0.072 

0.072 

0.072 

0.072 

i Hi 

h\\\ 

0.169 

0.201 

0.212 

0.230 

0.246 

0.234 

0.240 

0.247 

0.265 

0.269 

0.246 

0.246 

0.246 

0.246 

0.246 

0.228 

7 

i 

It 

0.017 

0.185 

0.213 

0.260 

0.285 

0.322 

0.323 

0.444 

0.455 

7 

i 

Rt 

0.488 

0.298 

0.286 

0.266 

0.266 

0.262 

0.262 

0.190 

0.190 

Hi 

0.008 

0.523 

0.524 

0.534 

0.562 

0.610 

0.631 

0.655 

0.655 

0.458 

0.119 

0.119 

0.119 

0.119 

0.119 

0.119 

0.119 

0.119 

3 

It 

0.198 

0.246 

0.251 

0.251 

0.262 

0.266 

0.266 

0.279 

0.295 

3 

Rt 

0.321 

0.321 

0.321 

0.321 

0.292 

0.292 

0.292 

0.292 

0.292 

1 

It 

0.081 

0.129 

0.136 

0.142 

0.143 

0.191 

0.261 

0.267 

0.287 

1 

Rt 

0.351 

0.304 

0.304 

0.304 

0.304 

6 . 3o4 

0.268 

0.268 

0.262 

2 

It 

0.095 

8:383 

0.217 

0.224 

0.224 

0.231 

0.487 

0.496 

0.505 

2 

Rt 

0.327 

0.260 

0.280 

0.260 

0.260 

0.165 

0.179 

0.179 

1 

I> 

0.220 

0.231 

0.236 

0.231 

0.271 

0.380 

0.396 

0.523 

0.548 

0.548 

0.553 

1 

Rt 

0.231 

0.231 

0.231 

0. 160 

0.146 

0.148 

0.148 

2 

It 

0.059 

Mil 

0.080 

8:11? 

0.265 

0.265 

0.265 

0.294 

0.326 

2 

Rt 

0.361 

0.361 

0.272 

0.272 

0.272 

0.272 

0.249 

r.x 
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TABLE  12.  MUTUAL  INFORMATION  ANI)  RESIDUES  FOR  POSTITERATIVE  OPTIMIZATION 
OF  MINMAX  ACCLOMERATIVE  ( LUSTERS  OF  SCANNING  PATTERNS 


OG /HP 

Cs  2 

3 

4 

5 

6 

7 

6 

9 

10 

0 

1 

i: 

0.574 

0.666 

0.609 

0.631 

0.630 

0.627 

0.654 

0.659 

0.665 

0 

1 

r: 

0.090 

0.066 

0.078 

0.078 

0.096 

0.096 

0.09b 

0.096 

0.096 

0 

2 

I : 

0.389 

0.447 

0.411 

0.492 

0.534 

0.542 

0.53b 

0.604 

0.645 

0 

2 

R : 

0.157 

0.181 

0.187 

0.169 

0.199 

0.157 

0.157 

0.151 

0.120 

0 

3 

I: 

0.049 

0.133 

0.177 

0.147 

0.191 

0.217 

0.287 

0.353 

0.403 

0 

3 

R: 

0.392 

0.289 

0.265 

0.337 

0.265 

0.253 

0.235 

0.199 

0.169 

1 

1 

I: 

0.023 

0.035 

0.040 

0.065 

0.087 

0.166 

0.135 

0.1  30 

0.137 

1 

1 

Hi 

0.411 

0.393 

0.405 

0.375 

0.357 

0.315 

0.345 

0.345 

0.339 

1 

2 

I: 

0.328 

0.290 

0.459 

0.443 

0.367 

0.398 

0.339 

0.452 

0.424 

1 

2 

r: 

0.196 

0.226 

0.185 

0.202 

0.226 

0.202 

0.268 

0.19b 

0.19b 

2 

1 

It 

0.053 

0.077 

0.062 

0.096 

0.061 

0.066 

0.190 

0.148 

0.176 

2 

1 

R: 

0.393 

0.363 

0.375 

0.369 

0.387 

0.367 

0.28b 

0.315 

0.292 

2 

2 

i: 

0.674 

0.658 

0.711 

0.617 

0.684 

0.792 

0.696 

0.709 

0.751 

2 

2 

r: 

0.060 

0.077 

0.095 

0.119 

0.101 

0.077 

0.101 

0.083 

0.071 

2 

3 

I: 

0.013 

0.019 

0.014 

0.021 

0.077 

0.093 

0.09b 

0.128 

0.145 

2 

3 

ft: 

0.435 

0.429 

0.446 

0.435 

0.387 

0.345 

0.345 

0.327 

0.327 

1 

IS 

0.002 

0.012 

0.023 

0.076 

0.066 

0.100 

0.082 

0.074 

0.143 

1 

r: 

0.473 

0.444 

0.426 

0.379 

0.367 

0.355 

0.373 

0 . 3b  1 

0.343 

2 

I: 

0.049 

0.108 

0.093 

0.095 

0.095 

0.158 

0.175 

0.179 

0.214 

2 

R: 

0.373 

0.325 

0.349 

0.349 

0.343 

0.314 

0.302 

0.302 

0.276 

1 

IS 

0.001 

0.036 

0.287 

0.305 

0.265 

0.255 

0.298 

0.294 

0.316 

1 

ft: 

0.485 

0.401 

0.222 

0.216 

0.269 

0.275 

0.257 

0 . 2b9 

0.263 

2 

i: 

0.220 

0.212 

0.251 

0.263 

0.286 

0.315 

0.347 

0.329 

0.367 

0.376 

2 

ft : 

0.234 

0.263 

0.216 

0.257 

0.222 

0.228 

0.192 

0.198 

5 

1 

IS 

0.044 

0.414 

0.457 

0.540 

0.627 

0.552 

0.54b 

0.589 

0.554 

5 

1 

ft: 

0.379 

0.178 

0.148 

0.124 

0.083 

0.130 

0.124 

0.116 

0.118 

5 

2 

I: 

0.435 

0.408 

0.393 

0.380 

0.439 

0.364 

0.397 

0.532 

0.536 

5 

2 

ft: 

0.136 

0.166 

0.183 

0.183 

0.148 

0.169 

0.207 

0.154 

0.154 

1 

is 

0.235 

0.451 

0.619 

0.625 

0.612 

0.611 

0.600 

0.558 

0.607 

1 

ft: 

0.269 

0.162 

0.078 

0.078 

0.076 

0.076 

0.084 

0.096 

0.064 

2 

I : 

0.749 

0.754 

0.762 

0.789 

0.694 

0.676 

0.705 

0.760 

0.790 

2 

ft: 

0.042 

0.042 

0.042 

0.048 

0.066 

0.072 

0.066 

0.054 

0.048 

3 

i: 

0.177 

0.194 

0.285 

0.346 

0.390 

0.381 

0.314 

0.344 

0.368 

3 

r: 

0.269 

0.251 

0.210 

0.196 

0.174 

0.18b 

0.216 

0.198 

0.192 

1 

I: 

0.732 

0.601 

0.624 

0.611 

0.488 

0.496 

0.544 

0.781 

0.776 

1 

Hi 

0.048 

0.089 

0.089 

0.089 

0.149 

0.149 

0.143 

0.048 

0.054 

2 

IS 

0.607 

0.487 

0.661 

0.665 

0.684 

0.684 

0.747 

0.723 

8:21$ 

2 

ftt 

0.077 

0.131 

0.077 

0.095 

0.095 

0.095 

0.095 

0.089 

3 

IS 

0.446 

0.332 

0.347 

0.363 

0.410 

0.401 

0.428 

0.435 

0.490 

3 

Rt 

0.149 

0.238 

0.220 

0.214 

0.185 

0.185 

0.179 

0.214 

0.161 

1 

IS 

0.139 

0.185 

0.236 

0.246 

0.235 

0.215 

0.274 

0.291 

0.349 

1 

RS 

0.286 

0.274 

0.274 

0.266 

0.260 

0.266 

0.250 

0.250 

0.226 

2 

IS 

0.312 

0.452 

0.366 

0.441 

0.519 

0.450 

0.542 

0.575 

0.593 

2 

RS 

0.196 

0.190 

0.244 

0.196 

0.155 

0.196 

0.179 

0.155 

0.149 

1 

IS 

0.494 

0.449 

0.534 

0.551 

0.547 

0.512 

0.563 

0.567 

0.585 

1 

ft: 

0.112 

0.130 

0.136 

0.136 

0.136 

0.142 

0.118 

0.118 

0.112 

2 

IS 

0.110 

0.030 

0.240 

0.292 

0.268 

0.270 

0.291 

0.302 

0.367 

2 

R : 

0.320 

0.426 

0.278 

0.249 

0.264 

0.284 

0.276 

0.243 

0.201 
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TABLE  21.  T-FUNCTION  QUANTIZATION  THRESHOLDS 


TABLE  13.  MUTUAL  INFORMATION  AND  RESIDUES  FOR  P REITERATIVE  OPTIMIZATION 
OF  MINAVE  AGGLOMERATIVE  CLUSTERS  OF  RECOGNITION  PATTERNS 


DG/RP  C 

::  2 

3 

4 

5 

6 

7 

6 

9 

10 

0 

0 

It 

0.024 

0.068 

0.066 

0.088 

0.095 

0.095 

0.095 

0.095 

0.693 

0 

0 

R: 

0.482 

0.422 

0.422 

0.422 

0.410 

0.410 

0.410 

0.410 

0.066 

1 

0 

It 

0.403 

0.403 

0.447 

0.448 

0.470 

0.464 

0.497 

0.500 

0.500 

i 

0 

Rt 

0.167 

0.167 

0.143 

0.143 

0.137 

0.131 

0.131 

0.131 

0.131 

0 

It 

0.024 

0.043 

0.049 

0.061 

0.727 

0.729 

0.729 

0.740 

0.741 

2 

0 

Rt 

0.476 

0.45M 

0.458 

0.446 

0.060 

0.060 

0.060 

0.060 

0.060 

3 

0 

It 

0.006 

0.014 

0.023 

0.541 

0.567 

0.566 

0.572 

0.575 

0.577 

3 

0 

Rt 

0.497 

0.479 

0.462 

0.101 

0.101 

0.101 

0.101 

0.1U1 

0.101 

4 

0 

1 1 

0.006 

0.016 

0.024 

0.197 

0.196 

0.196 

0.235 

0.251 

0.276 

4 

0 

Rt 

0.497 

0.485 

0.479 

0.311 

0.311 

0.311 

0.305 

0.287 

0.269 

5 

0 

It 

0.661 

0.706 

0.736 

0.752 

0.767 

0.766 

0.768 

0.784 

0.797 

5 

0 

Rt 

0.06S 

0.059 

0.053 

0.047 

0.047 

0.047 

0.047 

0.041 

0.036 

6 

0 

It 

0.012 

0.631 

0.632 

0.634 

0.637 

0.637 

0.640 

0.640 

0.666 

6 

0 

Rt 

0.491 

0.078 

0.078 

0.076 

0.076 

0.078 

0.076 

0.076 

0.076 

7 

0 

It 

0.023 

0.789 

0.790 

0.791 

0.791 

0.793 

0.793 

0.614 

0.614 

7 

0 

Rt 

0.466 

0.042 

0.042 

0.042 

0.042 

O.04? 

0.042 

0.042 

0.O42 

8 

0 

It 

0.006 

0.023 

0.872 

0.873 

0.873 

0.874 

0.874 

0.875 

0.861 

8 

0 

r: 

0.466 

0.486 

0.018 

0.018 

0.016 

0.016 

0 . 0 1 6 

0.016 

0.016 

9 

0 

It 

0.031 

0.119 

0.128 

0.126 

0.136 

0.170 

0.170 

0.391 

0.393 

9 

0 

Rt 

0.462 

0.385 

0.365 

0.385 

0.367 

0.367 

0.367 

0.163 

0.163 

t 


TABLE  14.  MUTUAL  INFORMATION  AND  RESIDUES  FOR  POSTITERATIVE  OPTIMIZATION 
OF  MINAVE  AGGLOMERATIVE  CLUSTERS  OF  RECOGNITION  PATTERNS 


0G/RP 

Ct  2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

0 

It 

0.787 

0.678 

0.685 

0.668 

0.656 

0 . 8 1 0 

0.743 

0.696 

0.766 

0 

0 

Rt 

0.042 

0.060 

0.060 

0.066 

0.072 

0.042 

0.054 

0.060 

0.054 

1 

0 

It 

0.341 

0.318 

0.434 

0.447 

0.446 

0.375 

0.560 

0.501 

0.490 

1 

0 

Rt 

0.173 

0.202 

0.167 

0.179 

0.155 

0.179 

0.107 

0.143 

0.149 

2 

0 

It 

0.626 

0.515 

0.670 

0.676 

0.659 

0.631 

0.768 

0.717 

0.600 

2 

0 

Rt 

0.083 

0.131 

0.083 

0.071 

0.077 

0.125 

0.054 

0.077 

0.054 

3 

0 

It 

0.401 

0.347 

0.404 

0.463 

0.421 

0.561 

0.612 

0.623 

0.627 

3 

0 

Rt 

0.154 

0.219 

0.148 

0.160 

0.160 

0.124 

0.13O 

0.146 

0.142 

4 

0 

It 

0.151 

0.165 

0.213 

0.296 

0.398 

0.396 

0.322 

0.325 

0.432 

4 

0 

Rt 

0.261 

0.311 

0.311 

0.234 

0.160 

0.204 

0.240 

0.240 

0.204 

5 

0 

It 

0.549 

0.583 

0.829 

0.795 

0.706 

0.724 

0.725 

0.700 

0.711 

5 

0 

Rt 

0.095 

0.142 

0.063 

0.089 

0.118 

0.083 

0.065 

0.063 

0.065 

6 

0 

It 

0.841 

0.769 

0.682 

0.728 

0.734 

0.739 

0.785 

0.811 

0.813 

6 

0 

Rt 

0.024 

0.046 

0.084 

0.054 

0.060 

0.054 

0.042 

0.054 

0.054 

7 

0 

It 

0.953 

0.829 

0.819 

0.766 

0.646 

0.852 

0.933 

0.785 

0.791 

7 

0 

Rt 

0.006 

0.030 

0.077 

0.065 

0.036 

0.036 

0.012 

0.071 

0.065 

6 

0 

It 

0.703 

0.797 

0.640 

0.640 

0.832 

0.764 

0.795 

0.602 

0.887 

6 

0 

r: 

0.054 

0.036 

0.030 

0.030 

0.042 

0.063 

0.077 

0.077 

0.024 

9 

0 

It 

0.554 

0.642 

0.505 

0.725 

0.750 

0.636 

0.664 

0.739 

0.763 

9 

0 

Rt 

0.095 

0.077 

0.154 

0.059 

0.069 

0.124 

0.112 

0.071 

0.077 
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TABLE  24  RECOGNITION  ERROR  (TE)  NORMALIZING  CONSTANTS 
Digil  Source  I Source  2 Source  3 Source  4 Source  5 


TABLE  IS  MUTUAL  INFORMATION  AND  RESIDUES  FOR  PREITERATIVE  OPTIMIZATION 
OF  MINMAX  AGGLOMERATIVE  CLUSTERS  OF  RECOGNITION  PATTERNS 


DG/RP 

C:  2 

3 

4 

5 

6 

7 

6 

9 

10 

0 

0 

i: 

0.212 

0.707 

0.711 

0.716 

0.721 

0.721 

0.724 

0.740 

0.776 

0 

0 

R: 

0.301 

0.054 

0.054 

0.054 

0.054 

0.054 

0.054 

0.054 

0.054 

1 

0 

I: 

0.17b 

0.222 

0.349 

0.355 

0.355 

0.396 

0.433 

0.433 

.0.446 

1 

0 

r: 

0.2b2 

0.262 

0.202 

0.202 

0 . 2u2 

0.202 

0.179 

0.179  Ti.179 

2 

0 

I: 

0.094 

0.226 

0.496 

0.525 

0.549 

0.549 

0.549 

0.549 

0.549 

2 

0 

r: 

0.321 

0.250 

0.149 

0.149 

0.149 

0.149 

0.1«9 

0.149 

0.149 

3 

0 

I : 

0.325 

0.326 

0.356 

0.407 

0.428 

0.473 

0.510 

0.536 

0.538 

3 

0 

r : 

0.195 

0.195 

0.195 

0.183 

0.183 

0.176 

0.130 

0.130 

0.130 

4 

0 

is 

0.162 

0.170 

0.204 

0.205 

0.205 

0.230 

0.250 

0.250 

0.250 

4 

0 

r: 

0.317 

0.311 

0.311 

0.311 

0.311 

0.311 

0.293 

0.293 

0.293 

5 

0 

I* 

0.387 

0.585 

0.782 

0.783 

0.783 

0.797 

0.797 

0.804 

0.804 

5 

0 

r: 

0.160 

0.124 

0.041 

0.041 

0.041 

0.041 

0.041 

0.041 

0.041 

6 

0 

i: 

0.204 

0.655 

0.674 

0.686 

0.692 

0.692 

0.692 

0.713 

0.713 

6 

0 

R: 

0.305 

0.066 

0.066 

0.066 

0.066 

0.066 

0.066 

0.066 

0.066 

7 

0 

i: 

0.953 

0.963 

0.963 

0.963 

0.969 

0.969 

0.969 

0.969 

0.969 

7 

0 

r: 

0.006 

0.006 

0.006 

0.006 

0.006 

0.006 

0.006 

0.006 

0.006 

S 

0 

i: 

0.558 

0.606 

0.707 

0.707 

0.707 

0.730 

0.736 

0.736 

0.738 

8 

0 

r: 

0.095 

0.095 

0.095 

0.095 

0.095 

0.069 

0.069 

0.089 

0.089 

9 

0 

i: 

0.369 

0.451 

0.453 

0.487 

0.502 

0.502 

0.574 

0.582 

0.595 

9 

0 

R: 

0.160 

0.160 

0.160 

0.130 

0.130 

0.130 

0.130 

0.130 

0.130 

TABLE  16  MUTUAL  INFORMATION  AND  RESIDUES  FOR  POSTITERATIVE  OPTIMIZATION 
OF  MINMAX  AGGLOMERATIVE  CLUSTERS  OF  RECOGNITION  PATTERNS 


0G/RP 

CS  2 

3 

4 

5 

6 

7 

6 

9 

10 

0 

0 

i: 

0.787 

0.689 

0.678 

0.726 

0.720 

0.764 

0.709 

0.719 

0.734 

0 

0 

R: 

0.042 

0.060 

0.060 

0.054 

0.060 

0.054 

0.060 

0.066 

0.066 

1 

0 

i: 

0.310 

0.409 

0.419 

0.362 

0.396 

0.523 

0.596 

0.515 

0.516 

1 

0 

r: 

0.165 

0.179 

0.167 

0.202 

0.185 

0.149 

0.125 

0.131 

0.137 

2 

0 

i: 

0.628 

0.466 

0.611 

0.710 

0 .690 

0.664 

0.675 

0.795 

0.789 

2 

0 

R: 

0.063 

0.149 

0.101 

0.065 

0.077 

0.095 

0.083 

0.054 

0.048 

3 

0 

IS 

0.401 

0.443 

0.473 

0.579 

0.602 

0.602 

0.557 

0.587 

0.620 

3 

0 

r: 

0.154 

0.13b 

0.1S4 

0.118 

0. 1 o7 

0.107 

0.116 

0.124 

0.130 

0 

IS 

0.153 

0.176 

0.252 

0.239 

0.294 

0.325 

0.370 

0.446 

0.416 

0 

RS 

0.281 

0.293 

0.234 

0.281 

0.210 

0.234 

0.196 

0.15b 

0.174 

0 

IS 

0.549 

0.735 

0.670 

0.693 

0.757 

0.742 

0.804 

0.805 

0.793 

5 

0 

Rt 

0.095 

0.053 

0.065 

0.077 

0.047 

0.053 

0.041 

0.041 

0.047 

0 

IS 

0.841 

0.770 

0.764 

0.745 

0.730 

0.7  36 

0.616 

0.829 

0.612 

0 

RS 

0.024 

0.048 

0.042 

0.048 

0.054 

0.054 

0.036 

0.048 

0.048 

0 

IS 

0.953 

0.911 

0.628 

0.887 

0.868 

0.861 

0.669 

0.936 

0.935 

0 

RS 

0.006 

0.012 

0.030 

0.030 

0.030 

0.024 

0.036 

0.012 

0.012 

0 

IS 

0.703 

0.797 

0.840 

0.840 

0.8O0 

0.623 

0.634 

0.828 

0.878 

0 

RS 

0.054 

0.03b 

0.030 

0.030 

0.042 

0.036 

0.042 

0.042 

0.030 

0 

IS 

0.554 

0.494 

0.505 

0.700 

0.568 

0.519 

0.645 

0.631 

0.704 

0 

RS 

0.095 

0.148 

0.154 

0.065 

0.107 

0.124 

0.095 

0.124 

0.083 
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TABLE  27.  HISTOGRAM  OF  DIGIT  RECOGNITION  PERFORMANCE 
No  ol  Percent  Number  of  Number  of  Number  of 
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Figure  34.  Comparison  of  Mutual  Information  for  Clustered 
Recognition  Patterns  Using  Four  Clustering  Algorithms 
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SECTION  V 

GENERAL  PURPOSE  SPEECH  I/O  CAPABILITY 


This  section  describes  the  methods  used  in  the  AP-I20B  version  of  the  digit  recognition 
program.  Topics  covered  include  the  data  collection  hardware,  the  iilter  simulation,  the  auto- 
correlation computation,  and  a discussion  ol  digitizing  and  playback  utilities. 

A.  SYSTEM  DESCRIPTION 


The  primary  impetus  of  designing  this  remote  speech  I/O  facility  was  to  relieve  the  host 
from  the  burden  of  controlling  analog-to-digital  (A/D)  anti  digital -to -analog  (D/A)  converters.  A 
second  consideration  was  to  develop  a method  to  get  data  into  an  array  processor  at  a high  rate 
of  speed,  to  allow  real-time  data  collection  and  processing.  Figure  35  is  a block  diagram  ot  the 
resulting  speech  I/O  subsystem. 


The  configuration  consists  of  a Tl  980B  host  computer,  a Floating  Point  Systems  AP-120B 
array  processor,  and  a Tl  990/10  computer  with  attached  A/D  and  D/A  converters.  The  990/10  ; 

collects  and  plays  out  data  under  the  control  of  a “mailbox”  memory  location  in  the  AP-I20B. 

Therefore,  the  990/10  can  be  controlled  by  either  the  980B  or  the  AP-120B.  The  AP-120B  is 
used  primarily  to  reduce  the  quantity  of  data  by  transforming  the  raw  speech  to  a more  compact 
form  (e.g.,  filtering  or  preprocessing).  In  a typical  application,  the  host  would  request  data  from 
the  990/10,  request  the  AP-120B  to  process  the  data,  and  then  request  that  the  results  of  that 

processing  be  sent  to  the  980B.  < 

Software  directly  used  in  the  I/O  subsystem  consisted  of  two  parts.  The  first  is  the  990/10 
software  that  controls  the  A/D  and  D/A,  buffers  the  input  and  output  speech,  and  controls 
speech  I/O  to  the  AP-I20B.  The  second  piece  of  software  runs  on  the  980B  and  is  basically  a 
device  driver  for  the  990/10.  This  driver  handles  the  channel  protocol  as  well  as  all  I/O  between 
the  980B  and  the  AP-120B.  In  addition,  existing  software  that  digitizes  and  edits  speech  data 
using  the  980B  internal  A/D  and  D/A  was  modified  to  use  the  new  data  acquisition  subsystem. 

The  offloading  of  the  host  was  carried  one  step  further  in  the  digit-recognition  program.  In 
order  to  free  the  host  from  controlling  the  990/10  and  AP-I20B,  the  AP-I20B  was  put  in  charge 
of  this  entire  process.  Since  the  AP-120B  has  direct  memory  access  (DMA)  capability  to  the 
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Figure  35.  Speech  Channel  Block  Diagram 


host,  the  host  need  only  tell  the  AP-120B  where  to  place  the  processed  data  and  when  to  begin 
collection.  This  scheme  allows  continuous  input  ol  speech,  since  host  activity  is  totally  disjointed 
from  the  data  collection  or  processing.  With  all  this  computational  burden  removed  from  the 
host,  it  can  easily  keep  pace  with  real-time  processing. 

B.  FILTER  SIMULATION  IN  THE  AP  120B 

The  first  process  initiated  on  the  input  speech  data  is  a recursive  tilter  simulation.  The 
center  frequencies  and  bandwidths  ot  this  simulation  are  designed  to  match  those  of  the 
hardware  filters  specified  in  Subsection  II. A and  Appendix  A.  The  filter  model  included  both 
preemphasis  and  envelope  shaping  to  match  the  hardware  filters.  I liese  lb  1 liters  were  typically 
sampled  every  10  milliseconds.  This  filter  simulation  accounts  for  approximately  60  percent  of 
real  time  when  data  are  collected  at  an  80-microsecond  sample  rate. 

The  output  of  the  filter  simulator  is  then  preprocessed,  and  the  output  of  the  preprocessor 
is  sent  to  the  980B  host  memory.  The  preprocessing  is  the  same  as  that  described  in  Appendix 
A. 

C.  VOICING  DECISION  FROM  THE  AP-120B  AUTOCORRELATION 

PITCH  TRACKER 

An  estimate  of  voicing  was  included  in  this  speech  input  subsystem  by  performing  an 
autocorrelation  on  the  input  speech.  This  consisted  of  sliding  a window  of  speech  over  previous 
speech.  Given  a frame  of  speech  data  consisting  of  N samples,  the  last  M samples  of  the  frame, 
Wr,  are  used  as  a sliding  window  for  a reverse  correlation.  The  normalized  cross-correlation, 
Rr(K),  between  the  sliding  window,  Wr,  and  the  M speech  samples  earlier  in  time  starting  at  the 
(M  Kmjn  )th  sample  is  used  for  this  reverse  correlation;  i.e.. 


Rr(K>  = 


X (M  - m + 1 ) X (M  m - K + 1 ) 

m i 


X2  (M  m + 1 ) 


£ 

m = 1 

,Kmin  < K < Kmax> 


£ 

m = 1 


X2  (M  m K + 1) 


(28) 


The  maximum  value  of  the  Rr(K).v  is  then  selected  as  the  voicing  indicator.  An  |Rr(K)max  | less 
than  about  0.6  indicates  an  unvoiced  frame,  while  an  |Rr(K)max  | approaching  1 indicates  a 
strongly  voiced  frame.  The  value  of  K tor  the  Rr(K)niax  was  not  used,  although  it  corresponds 
to  the  value  of  the  pitch  period  in  samples.  Typical  values  used  in  these  calculations  are; 

M = 375  (375  samples  at  80  microseconds  = 30  milliseconds) 

N = 125  (125  samples  at  80  microseconds  = 10  milliseconds) 

Kmjn  = 25  (25  samples  at  80  microseconds  = 2 milliseconds) 

K = 250  (250  samples  at  80  microseconds  = 20  milliseconds) 


SECTION  VI 

EXPERIMENTAL  RESULTS 


A.  SPEAKER-INDEPENDENT  DIGIT  RECOGNITION 

1.  Data  Sets 

A wide  variety  of  digit-recognition  testing  was  done  using  two  types  of  data  sets.  The  most 
heavily  used  data  set  was  the  total-voice  evaluation  data  set.  This  test  data  set  was  part  of  a data 
base  collected  in  a sound  booth  over  a 3-month  period  at  Texas  Instruments.  The  test  data  set 
was  extracted  from  around  the  middle  of  this  3-month  period,  to  avoid  the  initial  microphone 
fright  of  the  subjects,  and  consisted  of  one  repetition  of  one  of  10  possible  sets  of  10  six-digit 
sequences  uttered  by  106  subjects  (64  males.  42  females).  The  actual  sequences  used  in  this  data 
collection  are  shown  in  Table  1 7.  The  test  data  were  digitized,  edited,  and  preprocessed  during 
the  total  voice  contract  to  ensure  the  precise  replicability  of  the  test  data.  However,  since  these 
“test”  data  were  used  for  multiple  experiments  to  evaluate  the  effect  of  parameter  variations,  the 
validity  of  the  absolute  recognition  results  is  not  assured.  In  addition,  the  data  were  also 
idealized  by  editing,  which  avoids  spurious  false  recognition  of  background  noises  as  true  data. 
For  these  reasons,  further  experiments  were  performed  on  a second  data  base. 

The  second  data  base  used  is  a subset  of  a large  digit-recognition  data  base  currently  being 
collected  in  the  speech  community.  The  data  being  collected  use  sequences  devised  by  Martin 
and  Herscher,  modified  to  include  both  the  “oh”  and  the  “zero”  pronunciations  of  the  digit 
zero.  The  texts  being  used  in  these  data  collections  are  shown  in  Table  18.  All  the  multiple-digit 
sequences  are  supposed  to  be  said  in  a continuous  manner,  although  not  all  subjects  always 
complied. 

2.  Digit-Recognition  Results  for  Six-Digit  Sequences 

A total  of  29  evaluation  runs  were  made  on  the  1,060  six-digit  sequences  from  the  total 
voice  evaluation  data  set.  The  overall  digit  recognition  rates  and  conditions  for  all  runs  are  given 
in  Table  19.  Even  though  no  syntactic  constraints  (except  length)  were  applied  during  the  digit 
recognition,  the  total  voice  evaluation  data  set  used  was  compatible  with  the  design  data  since 
the  following  digit  pairs,  all  nasal-to- vowel,  glide  or  semivowel  (or  vice  versa)  transitions,  were 
disallowed  in  both  data  sets: 


0 1 

1 8 

2 

9 

3 

9 

4 9 

9 1 

0 8 

2 1 

3 

1 

4 

1 

7 1 

9 8 

0 9 

2 8 

3 

8 

4 

8 

X 

1 

r- 

The  first  of  these  evaluation  runs  (no.  45).  was  the  syntactically  unconstrained  digit 
recognition  for  the  final  evaluation  for  the  total  voice  study.  A detailed  description  of  the 
thresholds  and  parameters  for  the  evaluation  runs  is  given  in  the  total  voice  final  report.  The 
values  of  most  of  these  parameters  remained  unchanged  during  this  current  study,  except  as 
noted  in  this  section.  These  parameters  are  listed  below  along  with  the  values  used  for  run  no. 
45: 


TABLE  17.  THE  10  SETS  OF  10  SIX  DIGIT 
SEQUENCES  USED  IN  TESTING 


057342 

072358 

027683 

068513 

061934 

1 24063 

i 45867 

1 76840 

165327 

159034 

273064 

237945 

243057 

261754 

253760 

358206 

361907 

361745 

3468 1 0 

368405 

45 1 960 

458973 

468 1 53 

457063 

430752 

546207 

510264 

510426 

520463 

5 1 7943 

612703 

613270 

675904 

654372 

675823 

720364 

7245 1 3 

724351 

759403 

794302 

8695 1 2 

879045 

879630 

853706 

852734 

945703 

9465 1 3 

932750 

942607 

926034 

035162 

026954 

047658 

057423 

026873 

152374 

162573 

162735 

142760 

132057 

206457 

265903 

269305 

234768 

234687 

347620 

367514 

345172 

345687 

345768 

453076 

463275 

423579 

403685 

40325 1 

540276 

570369 

5 1 6890 

576804 

547602 

658793 

694583 

694230 

619473 

65 1 942 

759403 

758946 

740258 

750619 

790234 

85 1 762 

869350 

861023 

851924 

879630 

974581 

951672 

968027 

968450 

958170 

Reference-point  location  parameters 

Peak-to-valley  ratio  (PVR)  = 1.10 
Maximum  valley  point  error  (Max  VPE)  = 615 
OPTSEQ  (valley  point  sequencing  parameters) 


dt  limits  (dtmax , dtmin) 
Expected  dt  (dt) 


see  Table  I I of  total  voice  final  report1 


Minimum  expected  dt  (used  to  determine  dt*  for  the  denominator  in  the 
point-pair  error  calculation: 


dt*  = max  (dt.  dtmin  ) = 4 


Time  deviation  weighting  (0)  = 2 
Floor  of  valley  point  error  (OFFSET)  = 100 
Hypothesized  digit  parameters 

Minimum  absolute  average  energy  across  recognition  pattern  (ENmjn)  = 150 

Weighting  of  sequence  error  (SQ)  contribution  to  total  normalized  error  for 
digit  k (wk ) = 0.1 


TABLE  18.  MARTIN-HERSCHER  DIGIT  TEXTS 


Isolated  Digits 

9 6 

2 4 1 

7 ZERO  3 5 

8 

OH 

9 6 

2 4 1 

7 ZERO  3 5 

8 

OH 

Five-Digit  Codes 

(Pronounce  "0”  as 

ZERO.) 

08175 

10260 

55806 

67438 

44953 

32146 

29091 

60733 

68630 

91625 

81754 

79241 

(Pronounce  “0"  as  OH.) 

08175 

10260 

55806 

29091 

60733 

68630 

Three-Digit  Codes 

(Pronounce  "0”  as 

ZERO.) 

(1)  525 

(11)  990 

(21)  631 

(31)  005 

(41) 

033 

(2)  759 

(12)  583 

(22)  349 

(32)  140 

(42)  477 

(3)  101 

(13)  171 

(23)  565 

(33)  819 

(43) 

680 

(4)  626 

(14)  098 

(24)  113 

(34)  974 

(44) 

306 

(5)  202 

(15)  232 

(25)  460 

(35)  357 

(45) 

915 

(6)  727 

(16)  670 

(26)  892 

(36)  212 

(46) 

782 

(7)  366 

(17)  854 

(27)  964 

(37)  551 

(47) 

248 

(8)  044 

(18)  386 

(28)  076 

(38)  161 

(48) 

887 

(9)  843 

(19)  795 

(29)  228 

(39)  453 

(49)  939 

(10)  418 

(20)  429 

(30)  737 

(40)  508 

(50) 

694 

(Pronounce  “0"  as  OH.) 

101  990 

460 

005  033 

202  098 

076 

140  680 

044  670 

508  306 

Normalizers  to  account  for  expected  recognition  error  for  digit  k 
(TEk  ) values  given  later  in  this  section. 


Maximum 

allowable  total  normalized 

error 

for  digit  k (NEk ): 

Digit 

0 1 

2 

3 

4 

5 

6 7 8 

9 

I2X  97 

12  3 

no  ii 

6 1 1 

13 

110  109  107 

114 

70 


TABLE  19.  SYNOPSIS  OF  EVALUATION  RESULTS 


Run  No. 

Percent 

Correct 

Recognition 

Remarks  (Changes  From  Previous  Runs) 

45 

90.5 

Baseline  (TVBISS  final  parameters): 

Multiple  reference  patterns;  minimum  energy  = 150; 

0.1  percent  NE  thresholds;  PVR  =1.1;  maximum 
valley  point  error  = 615:  sequence  length  unconstrained. 

46 

90.0 

Same  as  run  no.  45  except  new  tree-searching  subroutine  (DECIDE) 
used  with  minimum  separation  = 3 centiseconds;  maximum  separation 
= SO  centiseconds. 

47 

89.4 

Same  as  run  no.  46  except  sequence  length  constrained  to  6. 

48 

89.9 

Same  as  run  no.  47  except  maximum  separation  = 120  centiseconds 
and  minimum  energy  = minimum  [ 1 50,  0.1  (maximum  energy  of 
all  hypothesized  digits )| . 

49 

90.0 

Same  as  run  no.  48  except  PVR  = 1.05  (for  this  run  only). 

50 

90.3 

Same  as  run  no.  48  except  point-pair  error  between  reference  points 

1 and  3 added. 

51 

89.3 

Same  as  run  no.  50  except  minimum  energy  = minimum  [150,  0.2 
(maximum  energy  of  all  hypothesized  digits)]. 

52 

91.8 

Same  as  run  no.  50  except  total  normalized  error  (NE)  for  3 reference- 
point  words  multiplied  by  0.9  (longer  words  are,  in  general,  more 
reliably  recognized). 

53 

91.9 

Same  as  run  no.  52  except  minimum  energy  = minimum  [150,  0.15 
(maximum  energy  of  all  hypothesized  digits)]. 

54 

92.3 

Same  as  run  no.  53  except  sequence  length  unconstrained  (for  this 
run  only). 

55 

93.8 

Same  as  run  no.  53  except  3-bit  quantized  T-function  added  to 
scanning  patterns  and  maximum  valley-point  error  = 860. 

56 

93.3 

Same  as  run  no.  55  except  OFFSET  in  SQ  eliminated  (for  this 
run  only). 

57 

94.1 

Same  as  run  no.  55  except  SQ  thresholds  = 2,000/1,000. 

58 

94.1 

Same  as  run  no.  55  except  maximum  valley-point  error  = 700  and 

SQ  thresholds  = (800,  330,  860,  400,  340,  400,  820,  700,  420,  430) 
for  digits  0 through  9,  respectively. 

59 

93.5 

Same  as  run  no.  58  except  difference  data  in  scanning  patterns  eliminated. 

60 

93.4 

Same  as  run  no.  59  except  middle  column  of  scanning  patterns  eliminated. 

61 

94.1 

Same  as  run  no.  58  except  with  minor  bug  in  subroutine  DECIDE 
corrected. 

62 

94.2 

Same  as  run  no.  61  except  with  minor  adjustments  to  the  TE 
normalizing  constants  to  account  for  TE  changes  due  to  T-function 
inclusion. 

63 

94.1 

Same  as  run  no.  61  except  with  minor  changes  to  the  quantization  levels 
for  T-function  (for  this  run  only). 

64 

91.9 

Same  as  run  no.  62  except  T-function  quantized  to  16  levels  maximum 
valley- point  error  = 1,100;  PVR  = 1.05;  SQ  thresholds  = 2,000/1,000. 

80 


TABLE  19.  SYNOPSIS  OF  EVALUATION  RESULTS  (Continued) 


Run  No. 

Percent 

Correct 

Recognition 

65 

94.9 

66,67 

95.1  > 

68 

95.2 

1 

69 

I 

95.3 

70-72 

95.2 

73 

95.3 

Remarks  (Changes  From  Previous  Runs) 

Same  as  run  no.  62  except  TE  normalizes  modified  to  account  for 
confusion-matrix  entries  from  run  no.  62. 


Same  as  run  no.  62  except  TE  normalizers  modified  to  account  for 
contusion-matrix  entries  from  previous  run. 


The  following  additional  parameters  were  introduced  for  the  syntactically  unconstrained 
digit  sequence  recognition  algorithm  added  during  the  current  contract. 

Minimum  (3-centisecond)  and  maximum  (120  centisecond)  interdigit  times  (times 
between  first  reference  point  of  one  word  and  last  reference  point  of  previous 
word) 

Minimum  acceptable  ratio  of  average  recognition  pattern  energy  for  each  word  to 
maximum  average  recognition  pattern  energy  for  all  words  = 0.15. 

Note  from  table  19  that  poorer  results  were  obtained  in  experiments  that  eliminated  either  the 
floor  (OFFSET)  to  the  valley-point  error  (run  no.  56)  or  the  difference  data  from  the  scanning 
patterns  (run  no.  59). 

The  digit  recognition  results  for  each  digit  for  selected  evaluation  results  are  shown  in 
Table  20.  The  evaluation  run  results  shown  are  only  those  exhibiting  significant  improvements 
over  previous  runs.  These  improvements  occurred  for  run  no.  52  because  the  total  normalized 
error  was  lowered  for  three  reference-point  words,  for  run  no.  55  because  the  T-function  (see 
Section  11 1. B for  definition)  quantized  to  3 bits  was  included  in  the  scanning  patterns,  and  for 
run  no.  73  because  the  normalizers  for  the  recognition  error  were  modified.  (Note  from  Table  19 
that  quantizing  the  T-function  to  4 bits  in  run  no.  64  degraded  performance.) 

The  change  made  for  run  no.  52  that  lowered  the  total  normalized  error  for  longer  words 
was  a heuristic  justified  only  by  the  fact  that  longer  words  (those  with  more  reference  points) 
are  less  likely  to  be  spurious  hypotheses.  This  same  philosophy  was  used  in  the  speaker- 
dependent  recognition.  These  heuristic  normalization  constants  (HNCs)  used  were  as  follows: 

No.  of  reference  points  2 3 4 5 6 7 

HNC  1.00  0.90  0.81  0.73  0.66  0.59 


TABLE  20  DIGIT  RECOGNITION  RESULTS  FOR 
SELECTED  EVALUATION  RUNS 


Evaluation  Run  Number 


Digit 

45 

50 

52 

55 

62 

73 

0 

91.9 

91.9 

94.3 

94.8 

95.4 

95.4 

1 

92.6 

94.0 

92.9 

95.2 

93.1 

92.4 

2 

76.7 

78.1 

88.3 

92.7 

92.7 

94.1 

3 

89.4 

89.3 

86.7 

89.3 

89.2 

91.9 

4 

88.6 

88.6 

88.2 

89.7 

90.2 

97.2 

5 

98.5 

96.2 

96.1 

97.6 

98.1 

95.9 

6 

95.8 

95.2 

97.8 

98.3 

98.3 

98.0 

7 

83.7 

84.4 

89.9 

93.8 

94.4 

98.0 

8 

97.5 

96.8 

93.3 

97.3 

98.5 

93.8 

9 

89.8 

89.1 

88.0 

89.3 

91.2 

93.7 

Overall 

90.5 

90.3 

91.8 

93.8 

94.2 

95.3 

The  inclusion  of  the  T-function  in  the  scanning  pattern  was  prompted  hy  vowel/nasal 
reference  points  being  moved  into  the  nasal  rather  than  being  at  the  phoneme  boundary.  This 
was  primarily  caused  by  the  inclusion  of  reference  patterns  to  accommodate  nasalized  vowels. 
Oftentimes,  even  words  having  non-nasalized  vowel-to-nasal  transitions  produced  lower  valley- 
point  errors  matching  a portion  of  the  nasalized  vowel-to-nasal  reference  pattern.  Since  the  desire 
was  to  favor  choosing  reference-point  candidates  at  the  locations  of  T-function  peaks  in  the 
input,  such  a bias  could  be  provided  by  including  an  inversely  quantized  value  of  the  T-function 
in  conveniently  unused  4-bit  fields  in  the  scanning  pattern  for  the  input  (Figure  36).  Since  the 
corresponding  4-bit  fields  of  the  reference  scanning  patterns  were  zero,  the  inverse  quantization 
of  the  T-function  meant  that  the  larger  T-function  values  (lower  inversely  quantized  values)  that 
usually  occur  at  phoneme  Foundries  would  produce  lower  scanning  errors  relative  to  those 
produced  when  the  spectral  or  energy  change  was  not  so  great  during  more  nearly  steady-state 
portions  of  the  word.  The  quantization  thresholds  given  in  Table  21  were  derived  from  a 
cumulative  distribution  plot  of  T-function  values  at  the  selected  reference  points  in  the  digit 
recognition  design  data  and  at  +1  and  +2  time  samples  around  those  points. 

In  addition  to  the  improved  recognition  performance  shown  for  run  no.  55  in  Table  20. 
the  benefit  of  including  the  T-function  is  the  scanning  pattern  can  also  be  seen  by  the  decrease 
in  the  average  recognition  pattern  error,  indicating  improved  time  registration  of  the  input 
speech.  This  decrease  is  shown  in  Table  22. 

The  third  performance  improvement  was  prompted  by  the  confusion  matrix  for  evaluation 
run  no.  62  (Table  23).  Four  quite  large  nonsymmetrical  substitutions  are  shown  in  the  off- 
diagonal  entries  of  the  confusion  matrix.  Since  the  digits  are  selected  that  minimize  the 
minimum  total  normalized  error  across  the  sequence,  adjustment  of  the  relative  errors  among 
digits  will  affect  the  distribution  of  substitutions  in  the  confusion  matrix.  The  mechanism  for 
performing  this  adjustment  can  be  seen  from  the  following  equation  for  the  total  normalized 
error  for  the  digit  k: 


TEk/no.  of  column  in  digit  k 1 wk 

Nf-  = HNC, + 

k ' TEk  normalizing  constant  NPP 


where  HNC'  is  the  heuristic  normalizing  constant.  TE  is  the  recognition  pattern  error,  NPP  is  the 
number  of  reference-point  pairs.  PPL  is  the  point-pair  error  between  two  reference  points,  and  w 
is  a weighting  constant  for  the  sum  of  the  PPEs. 

The  TEk  normalizing  constants  are  calculated  from  the  expected  values  of  TEk  as  follows: 


TEk  normalizing  constant  = 


TEk/no.  columns  in  k 


1 ■f' 

10 


— \ (TE,/no.  columns  in  i) 


Five  sets  of  values  for  these  normalizing  constants  are  given  in  Table  24,  derived  from  five 
different  sources: 

(1)  Ee  + Es  + Ea  from  Table  20  of  the  Speaker  Verification  III  report 

(2)  Values  of  Jc  for  the  number  of  reference  patterns  chosen  in  the  total  voice  study 
for  each  digit 

(3)  Values  of  TE  for  each  of  the  digits  in  correct  sequences  in  the  6-digit  sequence 
evaluation  data  set  for  run  no.  33 

(4)  Same  as  source  3,  for  run  no.  57.  which  includes  the  T-function  in  the  scanning 
patterns 

(5)  Values  derived  from  incrementally  changing  the  values  from  source  4 during  run  nos. 
62  through  73. 

Although  the  normalizing  constants  derived  from  source  5 will  certainly  give  somewhat 
biased  results  since  they  are  tuned  to  the  evaluation  set,  it  should  be  remembered  that  the  test 
set  is  reasonably  large  (106  speakers).  An  independent  test  on  the  second  set  of  data  described  in 
Subsection  VI. A. I showed  that,  while  not  achieving  the  19-percent  reduction  in  error  rate  on  the 
six-digit  sequences  between  run  nos.  62  and  73,  a 6-percent  reduction  in  error  rate  was  achieved 
using  the  normalizing  constants  from  source  5 from  that  achieved  using  those  from  source  3. 

The  confusion  matrix  for  the  final  evaluation  run  on  this  study  (no.  73)  is  shown  in  Table 


One  final  observation  made  on  the  results  of  the  evaluation  runs  was  the  usual  problem  of 
poorer  performance  for  females  than  for  males,  as  shown  both  by  the  overall  recognition  results 
in  Table  26  and  by  the  histogram  of  digit  recognition  performance  (table  27),  both  from 
evaluation  run  no.  73. 

3.  Digit-Recognition  Results  for  Three-Digit  Sequences 

Although  the  data  used  for  the  testing  reported  in  the  last  subsection  were  from  a large 
number  of  subjects  of  different  ages,  races,  dialects,  and  educational  backgrounds,  the  multiple 


- * - 


WORD 


TABLE  21.  T-FUNCT10N  QUANTIZATION  THRESHOLDS 


Quantized  Value 

0 

1 

2 

3 

4 

5 

6 
7 


Range  of  T-Function 

214  00 
163  213 
128-162 
101-127 
79-100 
60  78 
40  59 
0-39 


TABLE  22.  DECREASE  IN  AVERAGE  RECOGNITION  PATTERN 
ERROR  BY  INCLUDING  T-FUNCTION  IN  SCANNING  PATTERNS 
AVERAGE  RECOGNITION  PATTERN  ERROR 


Digit 

Run  No.  4S 
(No  T-Function) 

Run  No.  57 
(3-Bit  T-Function) 

0 

359.8 

354.0 

1 

440.4 

409.5 

2 

316.9 

311.3 

3 

294.3 

289.0 

4 

221.5 

218.8 

5 

431.4 

424.9 

6 

283.3 

275.2 

7 

423.4 

415.4 

8 

287.3 

288.3 

9 

444.7 

435.3 

TABLE  23.  CONFUSION  MATRIX 

FOR  DIGIT  RECOGNITION 

FOR  6-DIGIT  SEQUENCES  FOR  RUN  NO.  62 

Recognized 

0 

1 

2 

3 

4 

S 

6 

7 

8 

9 X 

0 

660 

8 

1 

10 

1 

2 

3 

— 

1 

3 

1 

4 

403 

1 

12 

6 

3 

— 

1 

2 

2 

2 

2 

1 

581 

21 

1 

— 

3 

— 

QZ3 



3 

8 

1 

21 

657 

4 

— 

— 

— 

[461 

1 

4 

1 

13 

— 

4 

644 

[49] 

— 

— 

— 



5 

1 

3 

1 

1 

2 

779 

— 

— 

1 

8 

6 

2 



— 

4 

— 

— 

752 

— 

5 

1 

7 

3 

7 

2 

2 

— 

4 

— 

719 

IT8l 

1 

8 

— 

— 

— 

3 

1 

— 

2 

— 

398 

1 

9 

2 

9 

1 

12 

1 

— 

— 

11 

402  1 

X 

— 

1 

— 

1 

— 

— 

— 

— 

2 



& 


TABLE  24  RECOGNITION  ERROR  (TE)  NORMALIZING  CONSTANTS 


Digit 

Source  1 

Source  2 

Source  3 

Source  4 

Source  5 

0 

1.039 

1.022 

0.998 

1.005 

1.020 

1 

1.170 

1.089 

1.225 

1.163 

1.150 

2 

0.708 

0.837 

0.878 

0.884 

0.890 

3 

1.085 

1.016 

1.021 

1.026 

1.020 

4 

0.853 

0.719 

0.681 

0.691 

0.738 

5 

1.182 

1.292 

1.195 

1.207 

1.010 

6 

0.705 

0.870 

0.788 

0.782 

0.730 

7 

1.052 

0.945 

0.977 

0.983 

1.170 

8 

1.007 

1.133 

1.008 

1.024 

0.860 

9 

1.200 

1.076 

1.229 

1.236 

1.330 

TABLE  25.  CONFUSION  MATRIX  FOR  DIGIT  RECOGNITION  1 

FOR  6-DIGIT  SEQUENCES  FOR  RUN  NO.  73 


Recognized 


0 

1 

2 3 

4 

5 

6 

7 

8 

9 

X 

0 

657 

4 

1 11 

2 

1 

3 

1 



3 

1 

1 

4 

401 

11 

12 

— . — 

— 

1 

— 

4 

1 

2 

2 

1 

589  21 

1 

— 

2 

2 

7 

1 



3 

11 

1 

23  678 

8 

— 

— 

1 

14 

1 



4 

— 

9 

1 

691 

9 

— 

1 

— 

— 



5 

— 

3 



8 

763 

— 

— 

— 

1 

21 

6 

2 

— 

1 7 

2 

— 

749 

1 

1 



1 

7 

2 

2 

1 1 

1 

1 

— 

741 

4 

1 

1 

8 

— 

1 

4 

2 

— 

7 

6 

380 

1 

4 

9 

5 

4 

I 10 

— 

I 

— 

— 

5 

413 

2 

X 

1 

1 1 

— 

— 

— 

— 

2 

— 

TABLE  26.  DIGIT-RECOGNITION  PERFORMANCE 

OF  MALES  AND  FEMALES 

Percent  Correct  Recognition 

Digit 

Males 

Females 

0 

97.8 

91.6 

1 

92.8 

91.8 

2 

94.4 

93.6 

3 

94.0 

88.2 

4 

99.8 

93.1 

5 

97.0 

94.1 

6 

99.1 

96.7 

7 

99.2 

96.1 

8 

94.9 

92.4 

9 

95.5 

91.4 

Average 

96.7 

93.1 

86 




TABLE  27.  HISTOGRAM  OF  DIGIT-RECOGNITION  PERFORMANCE 


No.  of 

Percent 

Number  of 

Number  of 

Number  of 

Errors 

Correct 

Subjects 

Males 

Females 

0 

100 

22 

19 

3 

1 

98 

20 

14 

6 

97 

21 

13 

8 

3 

95 

1 1 

6 

5 

4 

93 

11 

7 

4 

5 

92 

4 

1 

3 

6 

90 

5 

0 

5 

7 

88 

4 

2 

2 

8 

87 

3 

1 

2 

9 

85 

0 

0 

0 

10 

83 

3 

0 

3 

1 1 

82 

0 

0 

0 

12 

80 

0 

0 

0 

13 

78 

1 

1 

0 

14 

77 

0 

0 

0 

15 

75 

1 

0 

1 

>16 

<75 

0 

0 

0 

evaluations  using  these  same  data  made  a further  independent  test  mandatory.  Results  in  this 
section  use  the  three-digit  sequences  of  the  second  data  base,  excluding  those  with  the  “oh” 
pronunciations.  Results  in  the  next  subsection  are  for  the  digits  said  in  isolation  from  the  same 
data  base. 

Note  that,  for  these  50  sequences,  all  digits  appear  an  equal  number  of  times  and  in  all 
contexts  of  preceding  and  following  digits.  Since  the  original  application  of  the  work  done  in  the 
total  voice  study  was  for  syntactically'  constrained  sequences,  not  all  digit  pairs  were  used  in  the 
design  data,  as  described  in  the  previous  subsection.  Hence,  the  recognition  performance  is 
expected  to  be  poorer  for  digits  involved  in  these  transitions.  This  poorer  performance  does  not, 
however,  reflect  on  the  method  developed  for  choosing  reference  patterns,  but  only  on  the 
inadequacy  of  the  design  data  for  unconstrained  digit  recognition. 

The  data  used  in  these  tests  were  collected  in  sound  booths  or  sound-treated  rooms  at  two 
locations:  Texas  Instruments  (Dallas.  Texas)  and  the  Institute  for  Advanced  Study  of  the 
Communication  Process  (Gainesville,  Florida). 

Since  the  poorer  performance  of  females  has  been  demonstrated  in  the  previous  subsection 
(as  well  as  in  all  other  word-recognition  studies  that  have  been  done,  to  the  best  of  the  author’s 
knowledge),  experiments  were  performed  for  male  subjects  only,  12  from  Dallas  and  11  from 
Gainesville.  The  recognition  performance  is  shown  as  the  far  right  column  in  the  confusion 
matrix  in  Table  28.  The  overall  percent  correct  is  94.0. 

It  has  been  noted  in  these  studies  that  the  confusion  matrix  can  be  quite  speaker- 
dependent.  Since,  in  the  3-digit  sequence  data  base,  there  were  more  digits  ( 1 50  versus  60)  and 
fewer  subjects  (23  versus  106),  the  confusion  matrix  entries  are  more  susceptible  to  high 
substitution  rates  by  particular  speakers.  For  example,  23  of  the  48  3-for-2  substitutions  shown 
in  Table  28  were  caused  by  two  speakers.  However,  the  two  digits  with  the  greatest  reduction  in 
recognition  rate  (2  and  8)  from  that  given  in  Table  26  for  males  are  two  of  the  four  digits 
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TABLE  28.  CONFUSION  MATRIX  FOR  DIGIT  RECOGNITION 

FOR  3-DIGIT  SEQUENCES  CONSTRAINED  IN  LENGTH  . 

Recognized 

23456789 

X 

Percent 

Correct 

0 

340 

1 

— 

— 



3 

— 

98.8 

1 

1 

331 

7 

3 

8 

— 

95.9 

7 

7 

1 

287 

48 



1 

1 

83.2 

3 

3 

3 

5 

327 



1 6 

— 

94.8 

4 

1 

5 

— 

1 

330  6 



1 

95.9 

■O  c 

a 5 

— 

— 

— 

341 

1 

3 

98.8 

^ 6 

14 

— 

1 

1 

1 — 323 

3 1 

— — 

94.0 

7 

— 

1 

1 

1 



341  1 

— 

98.8 

8 

— 

2 

1 

29 

1 5 

2 294  10 

1 

85.2 

9 

— 

1 1 

— 

— 

1 

6 326 

1 

94.5 

X 

— 

— 

— 

1 

1 

2 

— 

— 

having  reference  points  located  at  the  word  boundaries.  Hence,  if  the  reference  scanning  patterns 
used  for  locating  these  points  do  not  explicitly  account  for  all  allowable  contexts,  then  these 
reference  points  may  be  missed  during  scanning  because  of  the  lack  of  a significantly  deep  valley 
in  the  scanning  error  (distance  to  the  reference  scanning  pattern).  In  such  a case,  even  though 
the  time-normalized  recognition  pattern  is  much  less  affected  by  context,  this  digit  would  not 
even  be  hypothesized  because  of  the  missing  reference  point. 

Since  the  percent  correct  achieved  for  the  3-digit  connected  digits  was  94.0  using  reference 
patterns  that  did  not  account  for  all  transitions,  it  is  reasonable  to  assume  that  the  96.7-percent 
correct  achieved  for  males  in  the  6-digit  sequences  was  minimally,  if  any  at  all,  caused  by  tuning 
to  the  test  set  during  the  parameter  evaluations  done  on  the  6-digit  sequence  test  set.  It  appears 
reasonable  that  such  a recognition  rate  could  be  achieved  on  the  unconstrained  digit  recognition 
if  the  reference  scanning  pattern  set  were  expanded  to  include  patterns  to  account  for  these 
transitions.  Such  patterns  could  be  generated  with  an  expanded  design  set  using  the  clustering 
techniques  developed  in  these  studies. 

4.  Digit  Recognition  Results  for  Isolated  Digits 

A very  limited  experiment  was  run  to  test  the  digit  recognition  on  isolated  digits.  The  test 
involved  two  samples  of  each  of  the  10  digits  said  in  isolation  from  the  same  23  speakers  used  in 
the  3-digit  sequence  test.  In  view  of  the  recognition  rates  achieved  on  continuous  speech,  the 
recognition  rate  of  the  isolated  digits  was  a surprisingly  low  95.4  percent,  with  over  one-third  of 
the  errors  being  caused  by  3-for-2  and  5-for-4  substitutions.  More  expected  would  be  results  such 
as  that  achieved  by  Martin2,  whose  error  rate  was  approximately  halved  between  his  test  T-2 
(Philadelphia  connected  digits)  and  his  test  T-3  (Philadelphia  -isolated  digits). 

One  possible  explanation  for  the  higher-than-expected  isolated  digit  error  rate  is  that  the 
patterns  contained  in  the  reference  set  to  account  for  contextual  variations  are  generating 
spurious  hypotheses  during  isolated  word  recognition. 

The  specific  confusion  matrix  for  the  isolated  digits  is  given  in  Table  29. 
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TABLE  29.  CONFUSION  MATRIX  FOR  DIGIT  RECOGNITION  OF  ISOLATED  DIGITS 


Recognized 

0 

1 

2 

3 

4 

S 

6 

7 

8 

9 

Percent 

Correct 

0 

45 

1 

— 

— 

— 

97.8 

1 

45 

— 

— 

— 

— 

— 

1 

97.8 

2 

42 

4 

— 

— 

— 

91.3 

3 

46 

— 

— 

100.0 

4 

40 

4 

— 

— 

87.0 

5 

46 

— 

100.0 

6 

1 

45 

97.8 

7 

44 

2 

95.7 

8 

3 

43 

— 

93.5 

9 

1 

— 

-> 

43 

93.5 

5.  Effect  of  Spectral  Normalization  Technique  on 

Digit-Recognition  Performance 

Subsequent  to  the  performance  tests  described  previously  in  this  section,  investigation  of 
the  high  3-for-2  substitution  rate  in  the  3-digit  sequences  revealed  a mechanism  for  improving 
recognition  results  and  for  making  them  less  susceptible  to  variations  in  background  noise.  This 
investigation  revealed  that  the  valley  point  error  for  reference  point  1 for  the  digit  2 (i.e.,  the 
silence/plosive  transition)  had  a higher  error  for  the  real-time  data  than  for  the  digitized  data. 
This  was  found  to  be  caused,  at  least  in  part,  by  different  “silence”  spectra.  The  differences  were 
alleviated  to  some  extent  by  increasing  amjn  in  the  constant  a*  used  in  normalizing  the 
regressed  filter  outputs  (see  discussion  in  Appendix  A): 

®j  — ®postj  ® nljn  (3  I | 

where  opo<.t^  is  the  post-regression  standard  deviation. 

The  specific  effects  on  the  spectrum  of  changing  a min  can  be  seen  from  the  three  spectra 
shown  in  Figure  37  for  the  word  two  as  said  by  J.S.  The  increase  in  the  number  of  hypothesized 
digits  and  the  decrease  in  the  total  normalized  errors  for  the  six  digits  in  the  sequences  from 
which  the  spectra  in  Figure  37  were  extracted  are  shown  in  Table  30.  A further  breakdown  into 
the  terms  that  make  up  the  total  normalized  errors  is  given  in  Table  31.  Since  these  tests  were 
done  directly  from  the  analog  tape  for  each  trial,  exactly  repeatable  filter  outputs  are  not 
obtained.  However,  general  trends  can  be  noted  from  these  tables,  such  as  the  over  50-percent 
drop  in  the  valley  point  error  for  reference  point  1 of  the  digits  5,  2,  and  4,  which  contain 
silence  or  low  energy  frieation  (/ f/).  An  additional  point  of  interest  is  the  consistently  lower 
recognition  errors  for  the  digitized  data  than  for  the  analog  data. 

The  expected  benefits  to  be  derived  from  using  a larger  on)jn  are  twofold.  First,  the 
normalized  spectrum  will  tend  to  be  more  even  for  silence  or  low-energy  fricatives,  making  the 
resulting  patterns  more  resistant  to  variations  in  background  noise.  Second,  since  the  reference 
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TABLE  30  TOTAL  NORMALIZED  ERROR  (NE)  FOR  DIGITS 


FROM  SEQUENCE  852 

734  FOR  SPEAKER  J.S. 

No  of 
Hypothesized 

Normalized  Error 

"min 

Digits 

8 

5 

2* 

7* 

3 

4 

18 

40 

55 

55 

63(70) 

48(54) 

59 

66 

37 

58 

60 

49 

49(55) 

47(53) 

64 

57 

62 

52 

54 

44 

45(51) 

50(56) 

52 

55 

100 

58 

51 

41 

45(50) 

46(52) 

58 

51 

150 

76 

51 

38 

40(45) 

48(54) 

52 

46 

200 

70 

40 

39 

42(47) 

45(51) 

50 

45 

250 

84 

38 

38 

41(46) 

45(51) 

47 

45 

375 

72 

40 

33 

36(41) 

45(50) 

43 

44 

Run  No.  53 
Kiin  = ft2) 

~52 

47 

38 

40(45) 

44(49) 

46 

54 

*Since  three  reference  point  digits  are  multiplied  by  0.9,  the  unadjusted  errors  are 
given  in  parentheses. 


TABLE  31.  VALLEY  POINT  ERRORS,  SEQUENCE  ERRORS  (SQ),  AND 
RECOGNITION  ERRORS  (TE)  FOR  DIGITS  FROM  SEQUENCE  852  734 
FOR  SPEAKER  3.S. 

8 5 2 


°min 

1 

2 

SQ 

TE 

1 

2 

SQ 

TE 

1 

2 

3 

SQ 

TE 

18 

389 

161 

256 

335 

385 

225 

312 

468 

442 

330 

320 

657 

418 

37 

346 

149 

216 

395 

265 

218 

252 

430 

308 

253 

293 

474 

340 

62 

338 

137 

202 

358 

264 

200 

214 

389 

291 

261 

268 

447 

320 

100 

250 

124 

152 

346 

279 

167 

198 

373 

255 

287 

250 

404 

328 

150 

324 

130 

190 

331 

181 

188 

158 

359 

206 

244 

231 

326 

297 

200 

240 

105 

136 

268 

214 

165 

164 

368 

213 

271 

231 

367 

308 

250 

210 

1 1 1 

126 

260 

218 

163 

166 

362 

181 

257 

273 

358 

298 

375 

212 

103 

122 

275 

191 

120 

124 

318 

195 

247 

220 

313 

273 

Run  No.  53 

303 

143 

190 

309 

222 

219 

200 

343 

144 

261 

253 

305 

308 

7 

3 

4 

°niin 

1 

2 

3 

SQ 

TE 

' 

2 

SQ 

TE 

1 

•> 

SQ 

TE 

18 

280 

272 

148 

353 

489 

406 

352 

448 

306 

376 

129 

214 

336 

37 

336 

267 

153 

392 

468 

420 

278 

404 

358 

239 

129 

152 

302 

62 

352 

263 

220 

487 

471 

374 

320 

388 

266 

193 

114 

122 

298 

100 

318 

255 

143 

382 

462 

350 

271 

356 

329 

202 

108 

122 

278 

150 

300 

259 

125 

359 

488 

314 

251 

310 

296 

181 

88 

102 

251 

200 

291 

247 

122 

324 

464 

306 

238 

294 

290 

189 

92 

108 

248 

250 

238 

252 

117 

289 

478 

267 

222 

252 

276 

179 

97 

108 

244 

375 

274 

239 

108 

287 

475 

273 

200 

230 

257 

181 

101 

1 10 

231 

Run  No.  53 

322 

306 

160 

418 

432 

307 

241 

284 

267 

304 

133 

182 

280 
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patterns  used  in  speaker-independent  digit  recognition  result  from  averaging  patterns  from  many 
speakers,  the  reference  patterns  appear  more  “washed-out,”  lacking  the  sharp  contrasts  found  in 
speaker-specific  patterns.  Hence,  a flatter  spectrum  on  the  input  speech  resulting  from  using  a 
larger  an)ln  would  probably  result  in  a better  match  to  speaker-independent  reference  patterns. 

This  hypothesis  was  tested  using  a tape  generated  at  RADC  (in  the  computer  room 
containing  speech-processing  equipment)  consisting  of  two  repetitions  each  of  two  speakers  (R.V. 
and  J.F.)  of  the  50  three-digit  sequences  given  in  Table  IX.  The  recognition  results  for  all  four 
repetitions  are  given  in  Table  3 2 using  both  a oniin  of  62  and  a amin  of  250.  The  recognition 
results  for  the  larger  amin  show  a small  improvement.  A performance  improvement  would 
correspondingly  be  expected  on  the  results  presented  in  the  previous  subsections  since  all  these 
results  used  data  preprocessed  using  a r/niin  of  62. 


TABLE  32.  omin  VARIATION  PERFORMANCE  TEST 


Length 

Constrained 

Length 

Unconstrained 

Subject 

Session 

a min  - 62 

amin  = 250 

amin  - 62 

^inin  = ■ 

J.F. 

1 

96.0 

93.3 

98.0 

95.3 

J.F. 

T 

95.3 

97.3 

96.7 

96.7 

R.V. 

1 

89.3 

90.0 

90.0 

92.0 

R.V. 

2 

91.3 

94.0 

92.0 

92.0 

Overall 

92.5 

93.7 

94.2 

94.0 

In  conclusion,  an  observation  made  concerning  sequence  recognition  during  the  testing  of 
these  two  subjects  should  be  noted.  In  an  operational  system,  where  sequences  can  be  repeated, 
the  only  consequence  of  rejected  sequences  is  a decrease  in  throughput  (assuming  a sequence  can 
finally  be  accepted  if  repeated):  therefore,  rejections  should  not  be  used  in  calculation  of  the 
percent  correct  recognition  rate.  The  percent  correct  sequence  recognition  rate  is  then  given  by 

no.  of  correct  sequences 

% correct  = 

no.  ol  correct  sequences  + no.  ot  incorrect  sequences 


or 


7 correct  = 


no.  of  correct  sequences 

no.  of  sequences  uttered  no.  of  rejected  sequences 


In  the  case  where  the  length  is  constrained,  the  present  tree-searching  algorithm  described 
in  Section  II  chooses  the  best  sequence  of  the  specified  length.  However,  the  sequence  recog- 
nition results  for  the  limited  testing  given  in  this  subsection  indicate  that  by  modifying  the 
algorithm  so  that  the  sequence  is  accepted  only  if  the  length  of  the  best  sequence  is  the  same  as 
the  specified  length,  the  sequence  recognition  rate  (as  defined  above)  would  improve  as  shown 
in  the  following  table: 
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No.  of  Utterances 

Correct 

Rejected 

Incorrect 

Percent  correct 


Best  Sequence 
of  Length  3 

336 

7 

57 

85.5 


If  Best  Sequence 
Is  of  Length  3 

330 

25 

45 

88.0 


B.  LIMITED  VOCABULARY  WORD-RECOGNITION  EXPERIMENT 

Five  speakers  from  the  Speech  Research  Branch  at  Texas  Instruments  were  tested  on  the 
limited  word-recognition  algorithm  using  an  automatic  enrollment  and  a hand  enrollment.  Each 
speaker  was  recorded  onto  analog  tape  while  seated  in  the  sound  booth.  An  enrollment  session 
collected  from  each  speaker  consisted  of  four  discrete  repetitions  of  the  following  words. 


Zero 

Five 

Minus 

Hundred 

One 

Six 

Plus 

Thousand 

Two 

Seven 

Point 

Enter 

Three 

Eight 

Backup 

Erase 

Four 

Nine 

Punch 

Display 

At  some  time  later  the  same  day  or  the  next  day,  an  execution  session  was  collected  from  each 
speaker.  The  execution  session  consisted  of  a set  of  20  phrases  of  three  randomly  chosen  words 
and  a set  of  20  phrases  of  random  lengths  (up  to  seven  words)  of  randomly  chosen  words.  Each 
phrase  in  the  execution  session  was  spoken  continuously.  Two  of  the  speakers  had  two  execution 
sessions  spaced  a half  day  apart. 

The  speakers  were  then  enrolled  off-line  using  both  automatic  enrollment  and  hand 
enrollment.  The  execution  sessions  were  then  tested  against  these  enrollments.  The  results  of  the 
experiment  are  given  by  the  confusion  matrices  of  Tables  33  and  34.  The  left  of  the  matrix 
shows  what  was  said  and  the  top  of  the  matrix  shows  what  was  recognized.  An  entry  in  the  “X” 
column  means  nothing  was  recognized  (a  deletion).  The  entries  in  the  matrix  are  the  number  of 
times  a word  was  recognized  versus  what  was  said.  A compilation  of  the  results  for  each 
inidividua)  speaker  is  given  in  Table  35.  One  of  the  speakers  (Keith)  had  two  execution  sessions, 
and  his  first  execution  was  used  for  a supervised  updating.  The  second  execution  session  was 
used  against  the  updated  reference  patterns.  The  results  are  given  in  the  last  column  of  Table  35. 
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TABLE  33.  CONFUSION  MATRIX  FOR  AUTOMATIC  ENROLLMENT 


DIS 


TABLE  35.  PERCENT  CORRECT  SPEAKER-DEPENDENT  RECOGNITION 
RESULTS  FOR  CONTINUOUS  UTTERANCES  FROM  A 21 -WORD 
VOCABULARY  ENROLLED  ON  ISOLATED  WORDS 


Automatic 

Hand 

Speaker 

Enrollment 

Enrollment 

Gene 

77 

78 

Richard 

73 

88 

Louise 

74 

87 

George 

76 

94 

Keith 

66 

80 

Updating 

(Hand) 
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SECTION  VII 

CONCLUSIONS  AND  RECOMMENDATIONS 


I he  three  major  areas  of  research  during  this  study  contract  were 

( I > High-performance,  speaker-independent,  connected-digit  recognition  for  syntactically 
unconstrained  digit  sequences 

(2)  Clustering  algorithms  for  use  in  the  development  of  sets  of  reference  patterns  for 
speaker-independent  word  recognition 

(3)  Automatic  enrollment  tor  speaker-dependent,  connected-word  recognition  for  syn- 
tactically unconstrained  word  sequences. 

The  program  culminated  in  the  installation  of  the  speaker-independent,  connected-digit  recogni- 
tion program  on  the  BISS-ADM  speaker  verification  system  at  RADC  using  the  total  voice 
reference  patterns  for  compatibility.  In  addition,  a long-standing  hardware  failure  with  the  digital 
filters  on  the  BISS-ADM  system  was  corrected,  resolving  performance  discrepancies  between  the 
systems  at  RADC  and  Texas  Instruments. 

As  part  of  the  three  tasks,  several  developments  resulted  that  are  generally  applicable  to 
the  speech  technology  used  in  this  study.  The  first  of  these  is  a modification  to  the  algorithm 
for  searching  the  table  of  hypothesized  words  (directed  graph)  that  significantly  reduces  the 
processing  time.  The  second  development  is  a technique  (transparent  to  previous  programs)  for 
including  a measure  of  the  spectral  transitionitivity  (T-function)  in  the  scanning  patterns  for  the 
purpose  of  improving  the  time  registration  of  reference-point  locations.  The  third  development  is 
the  capability  of  digitizing  and  playing  back  speech  data  through  A/D  and  D/A  connections  to 
the  last  array  processor.  I his  provides  the  basis  for  the  fourth  development,  which  is  simulation 
ot  the  digital  tillers  in  the  array  processor,  allowing  parametric  variation  of  the  filter-bank 
definition  and  the  consequent  ability  to  perform  a variety  of  tests  with  data  that  can  be  more 
precisely  replicated  using  a variety  of  filter-bank  definitions.  The  new  speech  channel  capability 
was  also  necessary  for  a fifth  development,  that  of  using  a quantized  autocorrelation  value  out  of 
an  autocorrelation  pitch  tracker  previously  implemented  on  the  array  processor  to  produce  a 
soil  voicing  decision  tor  each  frame  ot  filtered  speech  data  for  eventual  incorporation  into  the 
time-normalized  recognition  pattern.  The  sixth  general  development  came  as  a natural  extension 
to  the  capabilities  provided  by  the  speech  channel  and  filter  simulation.  This  development  is  the 
programming  of  the  preprocessing  function  in  the  array  processor  and  subsequent  amalgamation 
of  digitizing,  filtering,  and  preprocessing  in  the  array  processor  for  inputting  preprocessed  speech 
data  to  the  word-recognition  programs.  This  capability  reduces  the  980B  processing  time  by 
about  35  percent,  allowing  the  word  recognition  algorithm  that  uses  the  new  directed-graph 
searching  algorithm  to  operate  sufficiently  fast  to  allow  continuous  speech  input  without  having 
to  discontinue  sampling  after  the  input  of  an  utterance. 

The  speaker-independent,  connected-digit  recognition  portion  of  this  study  resulted  in  a 
significantly  faster  algorithm  with  a 50-percent  decrease  in  error  rate  over  the  course  of  this 
study  from  90.5  percent  correct  recognition  to  95.3  percent  on  an  evaluation  data  set  of  ten 
6-digit  sequences  from  106  speakers  (64  males,  42  females). 


The  development  of  the  clustering  algorithm  resulted  in  a two-stage,  four-path  algorithm 
with  the  mechanisms  for  detecting  outlying  data  points  in  the  design  data  and  with  subsequent 
analysis  routines  for  comparing  the  results  from  the  various  paths  and  testing  the  validity  of 


resulting  clusters  on  the  basis  of  comparisons  with  a priori  information  about  the  design  data  set. 
The  results  of  the  analysis  of  the  digit-recognition  design  data  set  revealed  that,  although  the 
clusters  selected  during  the  total  voice  speaker  verification  contract  generally  were  good  parti- 
tions of  the  data,  use  of  partitions  resulting  from  other  paths  in  the  more  comprehensive 
algorithm  would  have  resulted  in  somewhat  more  compact  clusters  in  terms  of  minimizing  the 
sum-of-squared  error. 

The  research  into  development  of  an  automatic  enrollment  technique  for  speaker- 
dependent  word  recognition  resulted  in  a method  that  yielded  very  good  results  for  isolated 
word  recognition  but  less  acceptable  results  when  used  in  continuous  speech  from  the  same 
speaker.  The  better  results  achieved  with  comparable  hand  enrollments  point  to  the  desirability 
of  a semiautomated  enrollment  procedure  allowing  the  operator  the  option  of  modifying 
reference-point  locations  and  recognition-pattern  format  definitions  defined  by  an  automated 
front  end.  Independent  of  the  enrollment  method,  however,  the  benefit  of  reference  file 
updating  as  a means  of  accommodating  contextual  variability,  as  well  as  intersession  variability, 
became  abundantly  clear. 

Throughout  all  three  phases  of  this  study,  the  general  limitation  existed  of  an  insufficient 
speech  data  sample  rate  and  spectral  resolution  of  the  filter  bank,  especially  in  the  higher 
frequency  bands.  This  limitation  must  be  removed  before  any  further  word  recognition  develop- 
ment. In  addition,  although  all  recognition  features  up  to  this  point  have  been  spectral 
amplitudes  or  direct  correlates  thereof  (regression  coefficients  and  energy),  it  is  time  that  more 
features  are  used.  This,  in  fact,  was  the  impetus  behind  the  addition  of  the  “soft”  voicing 
decision  (quantized  autocorrelation  coefficient)  to  the  spectral  parameters  derived  during 
preprocessing. 

Care  must  be  taken,  however,  that  none  of  the  new  features  added  are  subject  to 
measurement  errors  sufficient  to  actually  degrade  performance.  In  addition  to  not  degrading 
overall  performance,  new  features  must  not  degrade  performance  of  the  poor  speakers  while 
improving  the  results  for  the  good  speakers. 

However,  new  features  such  as  autocorrelation  values  or  formal  values  will  require 
computation  capabilities  exceeding  those  of  a 16-bit  minicomputer.  The  recommendation  for 
future  word-recognition  development  is  that  such  research  be  done  with  a computing  facility  that 

(1)  Is  capable  of  fast  arithmetic  both  for  longer  word-length  integers  and  for  floating- 
point numbers 

(2)  Contains  a large  (‘A  to  Zi  million  words)  primary  storage  with  virtual  memory 
capability 

(3)  Contains  an  operating  system  with  more  programmer-directed  features  than  typically 
available  on  16-bit  minicomputers,  allowing  more  of  the  time  now  spent  on  program 
development  to  be  spent  on  speech  algorithm  development 

(4)  Contains  a fast  array  processor  capable  of  performing  the  filter  simulation,  linear 
predictive  coefficient  (LPC)  computations,  formant  tracking,  autocorrelation 
computation,  etc.,  necessary  for  the  extended  feature  set  that  is  required  for  further 
recognition  performance  improvement. 
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APPENDIX  A 
SPEECH  PROCESSING 


The  speech  processing  used  in  this  study  is  based  on  the  relative  spectrum  of  speech  as  a 
function  of  time,  which  is  the  output  of  a 16-channel  digital  filter  bank  that  has  been 
preprocessed  as  described  in  this  appendix. 

1.  FILTER  BANK  DEFINITION 

The  spectrum  is  obtained  by  processing  the  speech  signal  through  a digital  filter  bank 
preceded  by  a first-order  differencing  network  (for  preemphasis).  The  filter  bank  consists  of  16 
bandpass  filters,  each  followed  by  a full-wave  rectifier  and  a four-pole  lowpass  Bessel  filter  with  a 
3-d B cutoff  at  30  II/.  Each  of  the  16  filters  is  sampled  100  times  per  second.  A block  diagram 
of  the  spectral  analysis  hardware  is  shown  in  Figure  A-l.  Actual  filter  responses  appear  in  Figure 
A-2  for  the  bandpass  filters  alone  and  in  f igure  A- 3 for  the  bandpass  filters  with  preemphasis. 

For  processing,  the  top  three  filters  are  summed  and  filter  14  is  replaced  by  this  sum. 
Filters  15  and  16  are  set  to  zero.  The  resulting  14  filter  outputs  at  each  time  sample  are 
represented  by. 

ab 
a2j 


U 1 4| 

2.  REGRESSION 

It  has  been  found  that,  by  eliminating  the  gross  aspects  of  the  spectrum,  such  as  the  slope 
and  curvature,  more  clearly  defined  formant  frequencies  are  obtained.  Therefore,  the  spectral 
amplitude  vector  is  regressed  by  the  first  three  elements  of  an  orthonormal  basis  set. 
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PREEMPHASIS 


. Spectral  Preprocessing  Functional  Block  Diagram 


Figure  A-2.  Digital  Filter  Responses 


Figure  A-3.  Digital  Filter  Responses  With  Preemphasis 
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Thus,  the  regression  tends  to  flatten  the  spectrum,  removing  any  half-cycle  sine  or  cosine  wave 
trends  of  the  spectrum  at  time  tj.  An  example  of  a spectral  waveform  having  a large  positive  c, 
is  a nasal,  which  has  one  peak  near  the  low  end  and  one  near  the  high  end  of  the  spectrum 
(around  250  Hz  and  2200  Hz).  An  example  of  a spectral  waveform  with  a large  positive  c2  is  a 
sibilant,  having  most  of  its  energy  above  3000  Hz.  Most  vowels,  however,  have  the  opposite 
spectral  tilt  because  of  the  glottal  source  spectral  decay  with  increasing  frequency,  yielding  a 
large  negative  value  of  c2 . 

3.  NORMALIZATION 


The  regressed  amplitude  vector  is  next  normalized  by  a modified  postregression  standard 
deviation,  a*  for  time  tj: 


°*  ^postj  + °min 


where 


a 


2 

post  j 


■*mj 


( A-3) 


and  omjn  = 62  for  this  study.  However,  it  has  been  noticed  that  regression  sometimes  eliminates 
too  much  of  the  variance  of  the  filter  output  vector  Aj.  To  limit  the  regression,  a limit  is  placed 
on  OpOSt  as  follows: 

Oposlj  = max  (Opostj  , Rmin  °prCj)  (A-4) 


where 


and  Rmin  = 0.6.  Note  that,  when  opostj  = R,mn  opri:j,  the  regression  coefficients  c,  and  c2  are 
reduced  in  order  to  decrease  the  amount  of  regression.  The  resulting  normalized  amplitude  vector 
is: 

1 - 

(Aj)N  = - (Aj)r  (A-5) 

a f 

The  regression  coefficients  c,  and  c2  are  also  normalized  hy  o*. 

4.  QUANTIZATION 

The  regressed  and  normalized  amplitude  vector  is  then  quantized  to  one  of  eight  levels 
according  to  a set  of  quantization  thresholds  0i(1  : 

( (ajj)N  > 0 iq 

( aij )q  = q IFF  / (A-6) 

( <aij>N  < 0i,q+i  for  q = 0,  1, . . 7 

where  <pi(f  < <j> j-q+,  ; and  0j8  = °°. 

Rather  than  have  these  quantization  levels  <0iq)  being  chosen  to  yield  a uniform  pro- 
bability, however,  it  was  more  desirable  to  have  the  quantization  thresholds  cluster  at  higher 
energy  levels.  In  this  way,  the  sensitivity  to  noise  can  be  reduced  and  quantization  resolution  is 
increased  in  the  region  of  interest  (which  is  the  spectrum  amplitude  at  the  formant  frequencies). 
The  actual  procedure  used  to  determine  the  quantization  thresholds  is  described  in  more  detail  in 
the  total  voice  verification  study  final  report;1  however,  the  quantization  thresholds  used  for 
each  of  the  14  filter  outputs  are  shown  in  Figure  A-4  and  those  used  for  the  two  regression 
coefficients  c,  and  c2  are  shown  below 


00 

01 

02 

0 j 

04 

0s 

06 

07 

08 

C| 

oo 

-3.0 

1.5 

0 

1.5 

3.0 

4.5 

6.0 

OO 

c2 

oo 

7.0 

-5.67 

4.33 

-3 

-1.67 

-0.33 

1.0 

oo 

5.  ENERGY 


For  each  time  sample,  a measure  of  the  energy  was  also  computed.  As  an  aid  to 
distinguishing  vowels  from  nasals  (which  usually  have  most  of  their  energy  in  a^)  and  vowels 
from  sibilants  (which  usually  have  most  of  their  energy  in  aM  ),  these  two  filters  were  not  used 
in  computing  the  energy  measure  in  the  following  expression: 
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APPENDIX  B 

SCATTER  MATRICES 


Inherent  in  many  criteria  used  in  clustering  is  the  concept  of  the  scatter  of  the  data, 
numerically  represented  by  scatter  matrices.  The  within-class  scatter  matrix  measures  the  distance 
of  the  “n”  sample  vectors  from  their  mean  vectors  and  is  the  sum  of  the  scatter  matrices  for  all 
“c”  classes.  The  within-class  scatter  is  given  by 


Sw  ~ 


£ 


*-XX 


(x  - nr);)  (x  - mj)T 


i=  1 


(B-l ) 


Tire  between-class  scatter  matrix  measures  the  distance  of  the  class  mean  vectors  from  the 
overall  sample  mean  and  is  given  by 


= 


c 

^ iij  (nij  - m)  < irij  - m)T 

i - 1 


(B-2) 


The  total  scatter  matrix  measures  the  distance  of  the  samples  from  the  overall  mean  as  given  by 


(x  - m)  (x  - m)T 

XtX 


(B-3) 


Note  that  Sr  = Sw  + Sb,  and,  therefore,  JSt  I = ISwl  + ISbI  and  tr  St  = tr  Sw  + tr  Sb. 


Anderberg32  summarizes  the  four  principal  criteria  that  have  emerged  using  scatter 
matrices: 

( 1 ) Minimize  tr  Sw . This  is  identical  to  the  sum-of-squared  error  criterion  of  Ward. 

(2)  Minimize  the  ratio  |SwI/|StI-  This  criterion  is  known  as  Wilks’  lambda  statistic. 
Equivalent  criteria  are  minimizing  |Sw  1 or  maximizing  |St  1/1Sw  I or  |I  + Sw'1  SbI. 

(3)  Maximize  largest  eigenvalue  of  Sw'1  Sb  (attributed  to  S.N.  Roy). 

(4)  Maximize  tr  Sw'1  Sb  (Hotelling’s  trace  criterion). 

The  last  three  of  these  criteria  involve  the  eigenvalues  of  Sw'1  Sb  that  are  invariant  under 
nonsingular  linear  transformations,  measuring  the  ratio  of  the  between-class  to  within-class  scatter 
in  the  direction  of  the  eigenvectors.  Duda  and  Hart, 28  however,  note  that  invariant  criterion 
functions  are  more  likely  to  possess  multiple  local  extrema,  and  are  correspondingly  more 
difficult  to  extremize. 


As  noted,  the  first  criterion  is  simply  the  sum-of-squared  error  criterion. 
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Specifically, 


J..„sw.£„s,=££ 

i=  I i=  1 \c, . 

'i 

With  some  manipulation,  this  can  he  rewritten  as 

1 X«Xj  S«\j 

Minimization  ol  this  criterion  is  equivalent  to  maximizing 


x 111, 


(B-4) 


(B-5) 


t 

tr  Sh  = ^ ' n,  llm,  m| 


i = i 


( B-6 ) 


since  tr  Sr  is  a constant.  I he  term  ir  S|,  can  also  he  manipulated  (see  Appendix  C)  to  give 


,r  S'<  = 2n  ^n'nJliTri 

1=1  j * i 


, mj| 


(B-7) 


Although  the  criteria  given  above  differ,  the  underlying  model  using  scatter  matrices  is  that 
of  "c”  fairly  well  separated  clouds  containing  roughly  equal  numbers  of  points.  Duda  and 
Hart.2*  however,  demonstrate  (Figure  B-l  I how  this  model  can  lead  to  poor  results. 
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(A)  (B) 

SUM-OF-SQUARED  ERROR  IS  SMALLER  FOR  (A)  THAN  FOR  (B) 
(FROM  DUDA  AND  HART28) 

Figure  B-l.  Problem  of  Splitting  Large  Clusters 
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APPENDIX  C 

tr  SB  ALTERNATE  FORM  DERIVATION 


By  definition. 


c C 

su  = / iij  linTj  m||2  = ^ iij 


(nij  m)r(mi  m) 


Expanding, 


tr  SR  = 


c c c 

“i'Tfj1  2 ^ n.m'm  + ^ nimT, 


But  the  sum  in  the  middle  term  can  he  rewritten  as 


r.  nin’i = nn,r = "i 


yielding 


t;  c 

tr  sb  = y 1 nim/itij  y * njmTm 


Multiplying  by  ( 2 n / 2 n ) and  splitting  the  first  term  in  half  with  appropriate  change  in  indices 


tr  SB  - (|/2n)  n ^ n^m,  2 n,mj  ^^m,  + n n^m/fi!, 


Expressing  n as  the  sum  of  the  nj.y  (or  n^v)  and  factoring  out  the  summations  and  n^  yields 


tr  SB  = (1/2 


n)  y 1 ninj  frhjlmj  2 m/ffij  + mjrnjl 

i I . I 1 J 


'=  i i=  i 
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APPENDIX  D 

POST  ITERATIVE  OPTIMIZATION  STATISTICS  FOR  RECOGNITION  PATTERNS 

STATISTICS  FOK  DIGIT:  0;  PEF  PT:  0 7 NO  OF  DATA  H T S S lbb 
mINAVE  A NO  MInmax  AGGLOm  CLUSTEklNG;  lb  MAk79 

POST  ITERATIVE  OPTIMIZATION  F OK  MIN  Jt 
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NO  OF 

ITtkS 

Jt  ( = 

TK(w)  ) 

Jfc  (d- 

JElC  + 1) 
/JE 1C) 

Tk(6)/T*(a) 

A V t 

MAX 

Ay/E 

MAX 

AVE 

MAX 

A1/E 

MAX 

1 

0 

0 

59bb2 . 7 

59562.7 
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0.146 

0 . 0 0 0 

0.000 

2 

75 

51 

59320.9 

59345.7 

0.056 

0.037 

0.174 

0.174 

3 

97 

34 

55667 . 7 

57  150.5 

0.0  34 

0.053 
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0.219 

4 
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42 

539b9 . 2 

541  38.4 

0.029 

0.040 
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1 34 

40 

S2407 . 1 

5196b. 6 
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0.037 

0.329 

0.34  1 

8 

1 1 3 

50 

51245.5 

50045.1 

0 . 0 4 7 

0 . 0 1 9 

0.  359 

0.392 

7 

151 

bO 

48820.9 

491  12.9 

0.012 

0 . 0 1 7 

0.427 

0.418 

8 

1 3b 

52 

48248.7 

48258.7 

0.016 

0.020 

0.444 

0.444 

9 

1 lb 

47 

47489.5 

47  30  7. 3 

0.02  7 

0.027 

0 . 4b  7 

0.473 

10 

122 

85 

45223.7 

45051.3******* 

******* 

0.507 

0.513 

C 

C (N-C)  TW  16) 

/2N  (C-l)  TW(a') 

oTL'S 

SIGMA 

f N-C ) *DEL Jt 
/Jt  (C) 

(\I-C)*Tk(6) 

/IC-1)*TW(«) 

AVE 

MAX 

Ay/E 

MAX 

A 0 1 

MAX 

AVE 

MAX 

1 0.000  0.000  0.000  0.000  24.495  24.03b  0.000  0.000 

2 0.172  0.172  0.343  0.543  9.547  b.Obb  24.591  28.511 

3 0.1tt2  0.161  0.372  0.  519  5.539  8.591  20.1c;4  17.843 

4 0.189  0.187  0.401  0.559  4.589  b.49b  15.702  15.485 

5 0.200  0 . 2 0 b 0.43b  0.4  55  5.558  5.954  1 3.253  1 3.70b 

o 0.208  0.227  0.471  0.490  7.570  2.980  11.500  12.544 

7 0.239  0.234  0.501  0.49b  1.864  2.7b5  11.315  11.088 

8 0.241  0.241  0.524  0.502  2.46b  3.115  10.018  10.011 

9 0.248  0.251  0.583  0.b4t  4.185  4.188  9.1b3  9.274 

10  0.2b 5 0.258  0.509  0.504  **************  8.769  8.867 

STATISTICS  6 UK  DIGIT:  17  9tF  PT:  07  NO  OF  DATA  PTS:  lb8 
4 1 N A 0 1 AND  MINMAX  AGGLOM  CLUST  EKING;  lb  MAk79 

POST  ITERATIVE  OPTIMIZATION  FOk  MlN  Jt 

NO  OF  Jt (C)-Jfe (CM) 

C ITEkb  Jt  (=Tk(rt))  / J E ( C ) TP(B)/TK(a) 


AVE 

MAX 

A VE 

MAX 

AVE 

MAX 

AVE 

MAX 

1 

0 

0 

7 0b7  9 . 1 

70679.1 

0.137 

0.136 

0 . 0 0 0 

0.000 

2 

19 

33 

b09b9 . 0 

60914.4 

0.064 

0.077 

0.159 

0.  lbO 

3 

73 

58 

57069.9 

5bl 97 .6 

0.078 

0.063 

0 .238 

0.258 

4 

8b 

57 

52b  45 . 4 

52b55 . b 

0.031 

0.0  37 

0.343 

0.342 

5 

132 

75 

50998.3 

50b  8 3 . 9 

0.03  3 

0.02b 

0.365 

0.395 

5 

135 

74 

49329.7 

49352. 1 

0.033 

0.039 

0.433 

0.432 

7 

155 

61 

47704.8 

47402.8 

0.031 

0.023 

0.462 

0.491 

6 

112 

91 

46239. 0 

4b291 .0 

0.025 

0.026 

0.529 

0.527 

9 

132 

93 

45093.1 

44985.8 

0.023 

0.017 

0 . 5b  7 

0.571 

10 

156 

83 

44062.9 

44201 . 1 * 

******* 

r *****  * 

o.boa 

0.599 

C (N-C) Tk (8) 

(N-C) *U£LJE 

( x-O* 

Ifi  (ti) 

C 

/ 2 n ( C - 

1 )1P  (iH) 

«TL’S 

SIGMA 

/je  (d 

/ (C- 1 ) * T 9 («) 

AVE  MAX  AVE 

1 0.000  0.000  0.000 

2 0.157  0.156  0.517 

3 0.175  0.190  0.359 

4 0.223  0.223  0.4b4 

5 0.234  0.239  0.495 

b 0.250  0.250  0.533 

7 0.259  0.275  0.564 

8 0.268  0.287  0.590 

9 0.302  0.304  0.521 

10  0.31b  0.313  0.530 
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•Xv  " •*«.  ■ 


MAX 

A Vfc 

MAX 

AVE 

MAX 

0.000 

22.94  3 

23.072 

0.0  00 

0.000 

0.316 

1 0 .5  1 5 

12.554 

26.438 

2b. 51 0 

0.379 

12.792 

10.41/0 

19.573 

21.259 

0.506 

5.131 

5.141 

16.72b 

18.712 

0.520 

5.333 

4.263 

15.726 

15.075 

0.535 

5.33b 

5.399 

14.022 

14.001 

0.550 

4.947 

3.776 

12.923 

13.176 

0.557 

3 . 9b5 

4.51  1 

12.061 

12.042 

0.625 

3.533 

2.773 

11.277 

11.351 

0.674 

*******1 

[ *****  * 

1 0.604 

10.51b 

STATISTICS  FOR  DIGIT:  2;  RtF  P T : UT  NO  OF  DATA  H T S : 1 bn 
MINAVE  AMD  MIMMAX  AGGLUM  CLUSTERING;  lb  *A*79 

POST  ITERATIVE  OPTIMISATION  F(JK  VIM  Jf. 


c 

MO  OF 

ITEMS 

Jt  ( = 

TM (A) ) 

Jt  CO- 

JE  C C •*- 1 ) 
/J  ECC) 

TK (b) / T R l a ) 

AVE 

MAX 

AVE 

MAX 

AVE 

MAX 

AVE 

MAX 

1 

0 

0 

51525.5 

51525.5 

0.125 

0.  124 

0.000 

0.000 

2 

72 

48 

45108.0 

45140.0 

O.Ubl 

0.060 

0. 142 

0.141 

3 

105 

48 

42371.1 

42424. b 

0.047 

0 . 030 

0.21b 

0.21b 

4 

140 

SO 

40377. b 

41131.9 

0.027 

0.049 

0.27b 

U.253 

5 

132 

82 

39289.9 

39098.7 

0.0  34 

0.033 

0.312 

0.316 

b 

92 

124 

37929.8 

37788.9 

0.025 

0.027 

0.358 

0.364 

7 

lib 

121 

38991 . 1 

3677  /.  1 

0 . o3  1 

0.031 

0.39  3 

0.40  1 

8 

1 34 

133 

358b  1.8 

35b40 . 8 

O.ol5 

0.015 

0.437 

0.44b 

9 

102 

113 

35323.3 

35122.3 

0.032 

0.014 

0.459 

0 . 4b7 

10 

138 

83 

34182.5 

3461  3.7 

************** 

0 . 6 0 7 

0.489 

CCN-C)TR(d) 

(N-C ) *UtL JE 

l g-C) * I k (6) 

C 

/2N(C- 

l)TR(w) 

6TL  ' S 

SIGMA 

/JE (Cl 

/(C-l) 

* T K ( A ) 

AVE 

MAX 

AVE 

MAX 

AVE 

MAX 

AVt 

MAX 

1 

0.000 

0.000 

0.000 

0.000 

2o.  675 

20.572 

0 .000 

0.00  0 

d 

0.140 

0.139 

0.285 

0.284 

1 0 . 0 1 1 

9.926 

23.474 

23.341 

i 

0. 15H 

0.157 

0.328 

0.329 

7.71b 

4.997 

17.716 

17.591 

4 

0.179 

0.1b3 

0.38  3 

0.437 

4.472 

8.057 

15.001 

1 3 . 7 2 9 

5 

0.188 

0.192 

0.  385 

0.486 

5.628 

5.427 

12.640 

1 2.672 

b 

0.20b 

0.209 

0.49b 

0.«94 

3.986 

4.311 

1 1.542 

11.705 

7 

0.218 

0.223 

0.545 

0.499 

4.885 

4.943 

10.478 

1 0 . o94 

8 

0.23b 

0.241 

0.549 

0.532 

2.388 

2.313 

9.921 

10.123 

9 

0.243 

0.247 

0.548 

0.539 

5.102 

2.288 

9.059 

9.224 

10 

0 . 2b3 

0.254 

0 . b08 

0.62b 

******* 

******* 

8.851 

8.523 

STATISTICS  FOM  DIGIT:  3 

; MEF  PT 

: o ; 

I\l0  06 

U A T A H T S : 169 

MINAVE  AMD  MINMAX 

AGGLOM 

CLUSTERING; 

lb 

M A R 7 9 

POaT 

ITERATIVE  DPT  I MI z AT  ION  FOR  MIN  Jt 

NO 

OF 

Jt  to- 

jtcc  + n 

C 

ITERS 

JE  ( = 

TW(a) ) 

/JE  CC) 

TM (8 ) / Tr ( A ) 

AVE 

MAX 

AVE 

MAX 

ave 

MAX 

AVE 

MAX 

1 

0 

0 

54075.5 

54075.5 

0.118 

0.118 

0 . 0 0 0 

0 . 0 0 0 

2 

70 

51 

4 7 b 7 3 . 7 

47698.1 

0 . OSb 

0 . 05b 

0.134 

0.134 

3 

98 

70 

44988.2 

4501b. 6 

0.042 

0.045 

0.202 

0.201 

4 

113 

49 

43089.4 

4300  7.2 

0.05o 

0.040 

0.255 

0.257 

5 

118 

87 

40949.0 

41295.5 

0.035 

O.038 

0.321 

0.309 

b 

90 

bb 

39523.0 

39708.2 

0.017 

0.042 

0.3k8 

0.362 

7 

7b 

bb 

38841.5 

38043.4 

0.033 

0 . 0 3 1 

0.392 

0.421 

8 

10b 

58 

37  5b5 . 5 

36879.2 

0 . o20 

0.024 

0.4  39 

0 . 4bb 

9 

112 

82 

36810.8 

36009.? 

o . 0 3 1 

0.021 

0 • 4b9 

0.502 

10 

127 

87 

35b75.b 

35267.3 

******* 

******* 

0.51b 

0.533 

C(N-C)TR(ri) 

(N-CJ  *DELJt 

l M-C  ) * 1 k ( rt ) 

C 

/2N(C- 

1 ) TM  ( A ) 

dTL’S 

SIGMA 

/JE  CC) 

/ (C-l  ) *Tk (*) 

AVE 

MAX 

A Vt 

MAX 

AVE 

MAX 

AVt 

MAX 

1 

0.000 

0.000 

0.000 

0.000 

19.889 

19.813 

0.000 

0.000 

2 

0.133 

0.132 

0.2/0 

0.270 

9.407 

9.366 

22.425 

22.328 

i 

0.149 

0.148 

0.302 

0.295 

7.00b 

7.410 

1 b . 7 b5 

lb. 702 

4 

0.  lbb 

O.lbB 

0.385 

0.348 

8.19b 

b . 5b  7 

14.023 

14.155 

5 

0.194 

0.188 

0.392 

0.382 

5.7  09 

b.  304 

1 3.143 

12.688 

b 

0.213 

0.209 

0.458 

0.435 

2.81  3 

6.834 

12.003 

11.795 

7 

0.219 

0.23b 

0.493 

0.490 

5.322 

4.957 

10.590 

11.378 

8 

0.239 

0.254 

0.566 

0.547 

3.234 

3.798 

10.103 

10.725 

9 

0.250 

0 . 2b  7 

0.594 

0.550 

4.934 

3.297 

9.38  0 

10.034 

10 

0.270 

0.279 

0.6  39 

0.579  , 

************** 

9.112 

9.422 
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statistics  FUK  DIGIT:  a;  pef  PT:  o;  no  OF  uata  pis:  167 
MlNAVE  ANU  MINMAX  AGGLOM  CLUS  T tP  I \»G  ; 20MAP 7 9 

POST  ITEPATIVE  OPTIMIZATION  FQK  MIN  J e 


NO 

UF 

JEIC)- 

J t l C ♦ 1 ) 

c 

ITtPS 

Jfc  ( = 

TP  f A)  ) 

/Jt(C) 

TP (B) /TP (w) 

AVE 

MAX 

A VE 

MAX 

AVt 

MAX 

AVt 

MAX 

1 

0 

0 

43604.4 

43604.4 

0. 127 

U.  127 

0.000 

0.000 

<2 

99 

32 

38061.0 

3 60  75.6 

0.054 

0.053 

0.146 

0.145 

3 

64 

84 

35993.5 

36052.4 

0.04  0 

0.03b 

0.211 

0.209 

4 

133 

59 

3453o.4 

34  7 38.5 

0.029 

0.040 

0.2b  3 

0.255 

5 

124 

123 

33525.7 

33340.6 

0.034 

0.039 

0.301 

0.306 

b 

113 

107 

32389.8 

32054.9 

0.041 

0.031 

0.346 

0.360 

7 

128 

lib 

31074.0 

3105b. 6 

0.015 

0.025 

0.403 

0.404 

8 

141 

63 

30b  1 7 . 9 

30289.4 

0.015 

0.035 

0.424 

0.440 

9 

149 

129 

30165.9 

29229.9 

0.034 

0.011 

0.445 

0.492 

10 

200 

115 

29153.2 

28699.5* 

****** 

******* 

0.49b 

0.509 

C CN-C) TP18) 

(N-C) *DtLJE 

(N-C)*Tp(6) 

C 

/2N(C- 

1)  TPU) 

b TL  ' S 

SIGMA 

/J E(C) 

/ IC-l ) *TW (w) 

AVE 

MA  X 

AVt 

MAX 

AVE 

MAX 

AVE 

MAX 

1 

0.000 

0.000 

0.000 

0.000 

20.468 

20.413 

0.000 

o.ooo 

a 

0.140 

0.139 

0.29  3 

0.295 

8.691 

6.503 

23.305 

23.232 

3 

0.151 

0.150 

0.375 

0. 3b 5 

6.437 

5.795 

16.610 

16.653 

4 

0.  lbb 

0.  lbl 

0.353 

0.365 

4.623 

b.357 

13.826 

13.442 

5 

0.177 

0.181 

0 . 4b  1 

0.452 

5.320 

6 . o 55 

1 1.800 

12.083 

b 

0.194 

0.202 

0.473 

0 . b32 

h.337 

4.658 

10.803 

1 1.241 

7 

0.216 

0.219 

0.541 

0.527 

2.275 

3.829 

10.417 

10.437 

» 

0.224 

0.232 

0.563 

0.539 

2.273 

5.367 

9. 331 

9 . b7  1 

9 

0.230 

0.253 

1.044 

0.590 

5.137 

1.730 

6.520 

9.405 

to 

0.251 

0.257 

0.845 

0 . b 1 1 * 

****** 

******* 

8.372 

8,594 

STATISTICS  FOP  DIGIT:  5 

; PE  F pt: 

o; 

>mO  OF 

UATa  PTS:  169 

MlNAVE  AND  MINMAX  AGGLOM  ClUbfEPING;  20MAP79 


POST  ITEPATIVE  OPTIMIZATION  F OP  MI.m  Jt 


NO 

UF 

JtlC)- 

Jt(C*l) 

C 

1 TEWS 

JE  ( = 

TP  (A)  ) 

/JE (C) 

Tk  (b)  /TPU) 

AVt 

MAX 

AVE 

MAX 

AVE 

MAX 

AVE 

MAX 

1 

0 

0 

87861 .3 

87861.3 

0.164 

0.  Ib4 

0.000 

0.000 

2 

15 

19 

73448.7 

7 3443.4 

0.041 

0.053 

0.19b 

0.19b 

3 

52 

56 

70442.7 

69540.9 

0.050 

0.047 

0.24  7 

0 . 2b  3 

4 

85 

45 

b6955 . 3 

66296.3 

0.045 

0.032 

0.312 

0.325 

5 

1 06 

4b 

63947.1 

64185.0 

0.028 

0.026 

0.374 

0.369 

6 

127 

43 

62144.9 

62522.0 

0.025 

0.029 

0.414 

0.405 

7 

12b 

59 

60578.4 

607  14.5 

0.034 

0.025 

0.4  50 

0.447 

8 

150 

b6 

58523.0 

59195.0 

0.024 

0.030 

0.501 

0.464 

9 

167 

75 

57107.8 

57430.3 

0.021 

0.023 

0.539 

0.530 

10 

176 

64 

55926.8 

56100.6************** 

0.571 

0 . 5b6 

C (N-C ) TP (b) 

(N-C) *D£LJE 

IN-C) *TW  (b) 

C 

/an (c- 

1)1P(A) 

bTL'S 

SIGMA 

/JE 1C) 

/(C-l)*TP(w) 

AVE 

MAX 

AVt 

MAX 

AVE 

MAX 

AVE 

MAX 

1 

0.000 

0.000 

0.000 

0.000 

27.394 

27.404 

0.000 

0.000 

2 

0.193 

0.193 

0.390 

0.390 

b.  794 

6.820 

32.5/4 

32.588 

3 

0.161 

0.193 

0.33b 

0.368 

8.169 

7.699 

20.400 

21.734 

4 

0.202 

0.210 

0.38  3 

0 . 403 

7.368 

5.223 

17.069 

17.782 

5 

0.225 

0.222 

0.441 

0.414 

4.694 

4.223 

15.239 

15.032 

6 

0.236 

0.233 

0. 4b7 

0.438 

4.084 

4.b83 

13.408 

13.131 

7 

0.250 

0.248 

0.509 

0.478 

5.4b3 

4.030 

1 2 . 0 b 5 

1 1.998 

b 

0.271 

0.262 

0.540 

0.542 

3.869 

4.770 

11.459 

11.069 

9 

0.285 

0.280 

0.561 

0.577 

3.288 

3.661 

10.703 

10.531 

10 

0.297 

0.294 

O.bl  1 

10.024 

9.939 
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STATISTICS  FOk  DIGIT:  b;  kt^  PT  : 0,*  NO  Of  UA  I A PTS:  1 c, 7 

MlNAVE  AND  MINMAX  AGGLOM  CLUSIEklNG,"  20MA*79 

HOST  ITEKAT  I\/E  OPTIMIZATION  FOk  mIn  JE 


c 

NO  OF 

ITEkb 

JE  ( = 

Tk  (aJ  ) 

Jt  IC)  - 

Je IC+ll 
/Jt(Cl 

r k ( h ) / r k ( a ) 

AVt 

MAX 

AVt 

MAX 

AVE 

MAX 

AVE 

MAX 

1 

0 

0 

b84 12.0 

58412.0 

o.17b 

0.17b 

o . o o o 

0.000 

2 

64 

47 

48111.0 

4Hl3o.5 

0 . 0 5 1 

0.081 

0.214 

0.213 

3 

84 

27 

4 5bb  9 . 2 

45o8b . 7 

0.0  2 / 

0.055 

0.279 

0.279 

4 

bo 

4 2 

44419.0 

431  7 7.7 

0.059 

0 . 022 

0.513 

0.353 

5 

88 

42 

41809.1 

4 22  4b  . 4 

o . 025 

0.0  30 

0.39  7 

0.383 

b 

100 

47 

40753.9 

4098b. 8 

0.035 

0.019 

0.453 

0.425 

7 

133 

49 

3932b . b 

40228.1 

0.017 

0 . 0 3b 

0.485 

0.452 

8 

1 3u 

90 

38659 . 4 

38789.0 

0.028 

0.003 

0.511 

0 .50b 

9 

lbO 

32 

37594.0 

3 8b  54 . 9 

0.01b 

0 . 0 2 1 

0 . 554 

0.511 

10 

14b 

bO 

3b99b.O 

3 7 8 2b . 5 < 

r * * * * * * 

******* 

0.379 

0.544 

C(N-C) T«CB) 

(N-C) * D t L J E 

( V-L  ) * Tk  (aj 

C 

/2N(C- 

l)Tk(rt) 

BTL'S 

SIGMA 

/JE  1C) 

/(C-l) 

* f k U) 

AVE 

MAX 

AVE 

max 

A Vfc 

MAX 

AVt 

MAX 

1 

0.000 

0.000 

0.000 

0.000 

28.921 

2e . b5u 

0 .000 

0 .000 

2 

0.209 

0.208 

0.422 

0.422 

8.27  3 

b . 29o 

34.900 

34 .795 

3 

0.203 

0.203 

0.430 

0.42b 

4.435 

8 . b 9 7 

22.bin 

22.561 

4 

0 , 2u2 

0.227 

0.37  1 

0.518 

9 . 4bO 

3.473 

18.906 

18.985 

5 

0.238 

0.229 

0.475 

0.530 

4 . U 3 8 

4.771 

1 5 . m a 4 

1 30b 

b 

0.248 

0.243 

0.519 

0 . 48b 

5 . 5b9 

2.943 

13.779 

1 3.520 

7 

0 , 2 b 8 

0.249 

O.SbS 

0.54b 

2 . b 8 0 

5 . b52 

12.780 

11.903 

8 

0.274 

0.272 

0.572 

O.bOl 

4.327 

0.543 

1 1 .4bO 

11.34b 

9 

0.291 

0.269 

0.590 

O.bOl 

2.48  1 

3.343 

10.798 

9 . 9 o 7 

10 

0.298 

0.281 

0.608 

0.633 

******** 

r *****  * 

9 . 9b 9 

9.372 

STATISTICS 

i FOB 

DIGIT:  li 

kt  F PT 

: o; 

N J Of 

l>  A T A Plt>:  1 b 8 

MlNAVE  AND  MINMAX  AGGLOM  CLUSTERING;  20MAk79 

HOST  ITtkATIVE  OPTIMIZATION  FQk  MIN  JE 


NO  OF  JECO-JttC-HJ 
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MISSION 

of 

Rome  Air  Development  Center 

PA IK  pla.no  and  execute*  research,  development,  test  and 
selected  acquisition  programs  in  support  of  Command,  Control 
Communications  and  Intelligence  (C^T ) activities.  Technical 
and  engineering  support  utithin  areas  of  technical  competence 
is  provided  to  ESV  Program  CM  ices  ( POs I and  other  BSD 
elements.  The  principal  technical  mission  areas  axe 
communications,  electromagnetic  guidance  and  control,  sur- 
veillance of  ground  and  aerospace  objects,  intelligence  data 
collection  ana  handling,  information  system  technology, 
ionospheric  propagation , solid  state  sciences,  microwave 
physics  and  electronic  reliability,  maintainability  and 
compatibility. 


